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In  this  thesis  we  examine  the  problem  of  modelling  data  base  contents  and 
data  placement  on  devices.  This  modelling  is  necessary  in  analytic  data  base 
performance  evaluation  studies  in  order  to  estimate  the  number  of  records  of  a 
file  that  have  to  be  retrieved  in  response  to  the  user(s)  requests,  as  well  as  the 
number  of  blocks  of  the  file  containing  these  records.  The  epu,  io,  and  telecom¬ 
munication  costs  of  the  system  are  directly  or  indirectly  expressed  in  terms  of 
these  quantities. 

We  first  show  that  certain  assumptions  used  for  modelling  data  base  con¬ 
tents,  data  placement  on  devices  and  user  requests  often  are  not  satisfied  in 
actual  data  base  environments,  and  that  they  may  lead  to  errors  in  model  pred¬ 
ictions.  Wc  examine  formally  implications  of  non-uniformity  and  dependencies 
of  attribute  values  in  data  base  design  and  data  base  performance  evaluation. 
Thereafter,  w^e  provide  more  detailed  modelling  techniques  based  on  a  multivari¬ 
ate  statistical  model,  and  we  demonstrate  their  use  in  improving  data  base  per¬ 


formance. 
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Chapter  1:  Introduction 


Large  data  bases  today  contain  hundreds  of  thousands  of  records.  In  such 
environments,  considerations  of  performance  are  particularly  important.  The 
performance  of  data  base  systems  however,  depends  on  a  large  number  of  inter¬ 
dependent  factors  [Sevcik  1981].  Some  of  these  factors  include  data  model,  the 
data  language,  the  storage  structure,  the  accessing  algorithms,  the  data  base 
contents,  the  data  placement  on  devices,  and  the  user  requests.  The  impact  of 
these  factors  on  data  base  performance  is  not  very  well  understood.  This  may 
lead  to  errors  in  modelling,  with  impact  on  data  base  design  and  data  base  sys¬ 
tem  performance  prediction. 

Several  data  base  performance  issues  have  been  examined  by  various 
researchers.  Among  them  are:  the  selection  of  suitable  strategies  for  accessing 
the  data  in  relational  data  base  environments  ([Selinger  et  al.  1979],  [Yao  1979], 
[Goodman  et  al.  1979]),  the  selection  of  a  set  of  indices  in  an  inverted  data  base 
environment  ([Schkolnick  1975],  [Hammer  and  Chan  1976],  [Anderson  and  Berra 
1977],  the  development  of  rnultiattribute  retrieval  file  structures  ([Cardenas 
1975],  [Bentley  1975],  [Rothnie  and  Lozano  1975],  [Aho  and  Ulman  1979],  [Lum 
1970],  [Liou  and  Yao  1977]),  and  data  base  performance  prediction  ([Sevcik 
1981],  [Gambino  and  Gcrritscn  1977],  [Teorcy  and  Das  1976],  [Teor  ey  and  Ober- 
lander  1978]).  Analytic  models  that  have  been  proposed  for  the  study  of  these 
problems  require  an  estimate  of  the  number  of  records  qualifying  in  a  quer)'',  the 
average  number  of  records  qualifying  in  a  set  of  queries,  the  number  of  blocks 
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containing  records  that  qualify  m  a  query,  and  the  average  number  of  blocks 
that  contain  records  qualifying  in  a  set  of  queries.  We  will  use  the  generic  name 
selectivities  to  refer  to  these  quantities.  The  performance  measures  are 
expressed  either  directly  or  indirectly  in  terms  of  selectivities. 

In  order  to  estimate  the  selectivities.  the  data  base  contents  and  the  data 
placement  on  devices,  as  well  as  the  queries  issued  by  the  user  population  have 
to  be  modelled.  Existing  models  of  data  base  performance  use  various  simple 
assumptions  to  model  data  base  contents,  data  placement  on  devices  and  user 
requests.  In  this  thesis  we  examine  the  impact  of  these  assumptions  on  data 
base  perform.ance  evaluation,  and  we  dem.onstrate  that  the  estimated  data  base 
performance  is  highly  sensitive  to  these  assumptions.  Finally,  we  provide  more 
detailed  estimates  of  selectivities  and  we  show  their  use  in  improving  data  base 
design  and  data  base  performance  prediction. 

In  Chapter  2  of  this  thesis,  we  present  the  data  base  environment.  In 
Chapter  3,  we  review  data  base  performance  evaluation  research  relevant  to  this 
thesis.  In  Chapter  4,  w'e  present  examples  of  common  large  data  bases  that  do 
not  satisfy  the  assumptions  usually  used  for  the  estimation  of  selectivities.  In 
Chapter  5,  we  demonstrate  that  large  errors  in  the  estimation  of  performance 
may  result  from  these  assumptions  and  we  show  that  these  estimates  are  often 
pessimistic.  We  then  discuss  the  implications  of  these  results  for  data  base 
design  and  data  base  performance  prediction.  In  chater  6,  we  present  a  mul¬ 
tivariate  statistical  model  for  the  estimation  of  the  number  of  records  qualifying 
in  queries.  We  show  that  better  approximations  of  the  number  of  qualifj'ing 
records  may  be  achieved  by  using  this  model.  In  Chapter  7,  we  present  a  new 
approach  for  the  estimation  of  the  number  of  blocks  containing  records  qualify¬ 
ing  in  queries.  In  Chapter  8,  we  show  applications  of  improved  approximations  in 
query  evaluation,  index  selection,  and  data  base  performance  predietion. 
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Finally,  in  Chapter  9  we  present  our  conclusions  and  directions  for  future 
research. 
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Chapter  2:  The  Bata  Base  Environment 


In  this  section,  we  will  describe  a  data  base  environment.  Our  de.scription 
includes  the  data  model,  the  query  language,  the  device  structures,  the  storage 
structures,  and  the  record  accessing  mechanisms. 

The  data  model 

The  relational  data  model  was  introduced  by  Codd  ([Codd  1970],  [Codd 
1971]),  and  is  described  in  detail  in  Date  [Date  1976],  and  Tsichritzis  and  Lochov- 
sky  [Tsichritzis  and  Lochovsky  1977].  The  user  of  the  relational  model  submits 
queries  in  a  query  language  asking  for  a  set  of  tuples  that  satisfy  the  query.  The 
data  base  management  system  evaluates  the  query  and  returns  to  the  user  the 
set  of  qualifying  tuples.  The  user  of  the  relational  model  does  not  have  to 
specify  to  the  system  how  to  access  the  data.  This,  in  addition  to  the  strong 
mathematical  basis  of  the  relational  model,  facilitates  the  development  of  high 
level  languages  that  provide  a  friendly  interface  for  the  user.  Moreover,  the 
underlying  storage  structures  can  be  changed  such  that  they  meet  the  new 
demands  of  the  user  population  without  affecting  the  user  programs.  However, 
since  the  user  does  not  specify  how  to  access  the  data,  the  system  must  be 
able  to  determine  a  good  accessing  strategy  for  a  particular  request.  Otherwise, 
the  response  time  may  be  needlessly  large.  This  situation  motivates  research 
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into  the  development  of  good  data  accessing  strategies  in  relational  data  base 
environments. 

The  qtiery  language 

We  assume  here  that  the  data  language  is  a  high  level  language  based  on  the 
relational  algebra  [Tsichritzis  and  Lochovsky  1977].  A  query  in  this  language  can 
be  decomposed  into  a  series  of  primitive  relational  algebra  operators  (join,  res¬ 
triction,  etc.),  interleaved  with  sei  operators  (union,  intersection  etc.).  Efficient 
processing  of  a  query  in  this  language  depends  on  which  primitive  operators  are 
involved  and  how  they  are  interleaved.  A  categorization  of  queries  into  groups 
with  a  certain  form  may  allow  efficient  accessing  strategies  to  be  developed  for 
each  group.  We  vidll  next  describe  a  categorization  of  queries  in  a  single  file 
environment  which  we  will  use  in  subsequent  chapters. 

An  attribute  condition  A  has  the  form:  B  ■^{JB2\J  '  '  '  \jBi^  where  the  B^'s  are 
conditions  on  the  same  attribute  and  have  the  form:  (Attr  Oper  Value)  where 
Attr  is  the  attribute  name,  Oper  is  one  of  <,  5,  =,  >,  >,  and  Value  is  a  value 
from  the  domain  of  values  associated  with  the  attribute.  B^'s  are  called  res¬ 
trictions  in  relational  terminology.  We  also  allow  Bi  to  be  of  the  form 
Attr —  \^x,  with  the  interpretation  that  the  attribute  takes  any  value  in  the 
subdomain  of  which  starts  with  x  (included)  and  ends  with  x-HAa: 

(included). 

Examples: 

(Z:IA'P#=3)U(/^£'P#=4) 

[SAL  <20000)  'kj{SAL  >40000) 

PAL  =[30000,  5000] 

A  conjunctive  condition  is  a  conjunction  of  attribute  conditions: 
AiHAsH  ■  ■  ■  D^m  where  all  the  A^’s  are  attribute  conditions  on  distinct  attri¬ 


butes. 
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Example: 

(SAL  >30000)  r\(DEP§  =3) 

A  disjunctive  condition  is  a  disjunction  of  attribute  conditions: 
■^iU-^2U  ■  '  ■  U-^m  where  all  the  A/s  are  attribute  conditions  on  distinct  attri¬ 
butes. 

Example: 

(A  >  30)  tJ  [{DEPjj:  =  2)  (j  (DEPjl  =  3)  ] 

A  simple  numeric  condition  is  of  the  form:  A  op  B  rel  C  where  op  is  *  or 
/,  and  rel  is  >,<,=, or  A  and  B  are  distinct  attribute  names  that  have  numeric 
domains  such  that  the  above  condition  is  meaningful.  C  is  a  constant.  The 
numeric  condition  A  rel  B,  which  is  a  special  case  of  the  simple  numeric  condi¬ 
tion,  is  CEilled  selection  in  relational  termAnolog3n  Arbitrary  arithmetic  expres¬ 
sions  are  not  considered  here  because  they  are  less  often  encountered  in  data 
base  environments,  and  also  because  it  is  difficult  to  optimize  them. 

Examples: 

QUANTlTY-IN  >  QUANTITY-OUT 
COST-PRICE  =  SALE-  PRICE 
UNIT-PRICE  *  QUANTm^-SOLD  >  1000 
YEAR-BORN  -  YEAR-EMPLOYED  >  25 

An  expression  is  in  disjunctive  normal  form  if  it  is  a  disjunction  of  conjuc- 
tive  conditions  and  simple  numeric  conditions. 

Examples: 

[(A CE*  >  18)  n  {SEX=FEMALE )  ]  L,?  >  20) 

(qUANTITY-IN  -  qUANTITY-OUT  >  1000)'^{UNIT -PRICE  >  100) 

An  expression  is  in  conjunctive  nonmal  form  if  it  is  a  conjunction  of  disjunc¬ 
tive  conditions  and  simple  numeric  conditions. 

Examples: 
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{SEX -MALE )  n  [(>1 >45)  ^  {DEGREE  =PHD )  j 
{SALARY -TAXES  >  15000)  {SEX^FEMALE ) 

The  device  structure 

The  data  base  is  assumed  to  reside  in  secondary  storage.  We  assume  that 
secondary  storage  consists  of  direct  access  storage  devices.  The  physical 
storage  is  divided  into  fixed-size  blocks  which  are  the  units  of  allocation  and 
transfer  between  main  storage  and  secondary  storage.  We  assume  that  the  tim*e 
to  read  or  write  a  block  of  secondary  memory  is  constant.  In  some  computer 
systems  the  data  base  (or  its  users)  may  reside  in  more  than  one  computer  site. 
In  this  case,  in  order  to  evaluate  a  query  it  may  be  necessary  to  move  data  from 
one  site  to  another.  Such  an  environment,  where  data  has  to  be  transmited 
from  one  site  to  another  through  telecommunication  lines,  is  called  a  distri¬ 
buted  data  base  environment.  In  a  distributed  environment  the  block  access 
times  in  the  various  sites  may  be  different. 

The  storage  structures 

We  assume  that  tuples  are  stored  as  a  unit  to  form  stored  records.  A  single 
stored  relation  will  be  referred  to  as  a  file.  The  number  of  records  in  the  file  will 
be  called  the  size  of  the  file.  We  will  assume  that  with  each  record  there  is 
stored  a  record  identifier  (RID)  that  uniquely  identifies  the  record.  While  a  rela¬ 
tion  does  not  imply  any  order  for  its  tuples,  the  records  of  a  file  have  a  fixed 
order.  The  records  of  the  file  are  divided  according  to  their  order  in  the  file  into 
logical  groups  called  pages.  Each  data  page  contains  a  fixed  number  of  records. 
A  file  is  clustered  on  an  attribute  A  if  its  pages  can  be  ordered  such  that  the 
attribute  values  of  A  appear  in  ascending  order  in  the  file.  Each  page  of  the  file 
is  mapped  into  a  different  block  of  the  secondary  storage  device.  Consecutive 
pages  of  the  file  may  be  mapped  to  non-consecutive  blocks  of  secondary  storage. 
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The  blocks  of  secondary  storage  that  are  used  for  storing  the  records  of  the 
data  base  are  called  data  blocks.  A  sequential  scan  of  the  file  requires  that  all 
the  data  blocks  containing  records  of  the  file  be  retrieved  from  secondary 
storage  in  the  order  that  the  pages  appear  in  the  file.  This  is  an  expensive 
operation  in  terms  of  block  accesses. 

Access  methods 

In  order  to  reduce  the  cost  of  data  access,  various  organization  methods 
may  be  employed.  Clustering  the  records  of  the  file  on  the  values  of  an  attri¬ 
bute  is  an  organization  method.  Other  organization  methods  are  indexing  and 
linking.  An  index  on  an  attribute  A  of  a  file  may  be  used  for  facilitating  the 
retrieval  of  the  records  of  the  file  that  obey  a  condition  on  the  values  of  the 
attribute  A.  We  assume  that  an  index  is  stored  as  a  balanced  multi-way  tree 
[Knuth  1973].  Its  leaf  nodes  contain  (attribute  value,  RID)  pairs  for  every  record 
in  the  file.  The  set  of  pairs  (attribute  value,  RID)  of  the  leaf  nodes  of  the  index 
for  a  specific  attribute  value  a  of  the  attribute  A  is  called  the  inverted  list  of  A 
for  the  value  a.  Not  every  attribute  necessarily  has  an  index.  If  an  attribute  has 
an  index  we  say  that  the  attribute  is  indexed  or  inverted.  If  an  index  exists  on  a 
clustering  attribute,  we  will  call  it  a  chj.stering  index.  File  organizations  in  which 
the  records  that  qualify  in  a  query  can  be  retrieved  by  either  sequentially  scan¬ 
ning  the  file  or  by  using  some  of  the  indices  for  facilitating  the  retrieval  are 
called  (partially)  inverted  file  organizations. 

The  blocks  that  are  used  for  storing  the  nodes  of  an  index  are  called  index 
blocks,  and  contain  a  fixed  number  of  (attribute  value,  RID)  pairs.  We  assume 
that  a  record  identifier  allo^vs  direct  access  to  a  record  in  the  sense  that  if  the 
RID  of  a  record  is  known,  only  one  block  of  the  secondary  memory  need  be 
retrieved  to  access  the  record.  If  an  index  on  an  attribute  A  exists,  to  find  the 
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RID’s  of  the  records  of  the  file  with  a  specific  attribute  A  value  only  one  node 
from  each  higher  level  of  the  index  tree  has  to  be  accessed.  We  also  assume 
that  the  pairs  (attribute  value,  KID)  of  the  leaf  nodes  are  ordered  such  that  no 
file  block  has  to  be  retrieved  more  than  once  when  the  records  of  a  specific 
attribute  value  are  retrieved.  Sometimes  it  is  useful  to  obtain  the  pairs  (attri¬ 
bute  value,  RID)  for  all  the  values  of  an  attribute  A.  We  assume  that  it  is  possible 
to  traverse  only  the  leaf  blocks  of  the  index  to  obtain  all  the  pairs  (attribute 
value,  RID)  ordered  on  attribute  value  and  RID.  An  index  may  also  be  created  to 
provide  access  to  the  records  of  the  file  that  have  attribute  values  in  a  subset  of 
the  attribute  domain  (instead  of  single  values).  For  example,  an  index  on  salary 
could  divide  the  domain  of  all  salaries  into  subdomains  (20000,25000), 
(25000,30000),  etc.. 

Another  useful  organization  method  is  linking.  Linking  may  be  used  to  con¬ 
nect  records  of  different  files  that  are  logically  related  [Tsichritzis  1975], 
[Haerder  1978].  Linking  can  be  implemented  in  various  ways.  We  will  assume 
that  the  set  of  RID’s  of  the  records  linked  to  a  specific  record  are  stored  with 
the  record  itself. 

To  retrieve  the  data  qualifying  in  a  query,  various  steps  may  be  required. 
Some  of  these  steps  may  create  organization  methods  that  will  be  used  in  subse¬ 
quent  steps.  Such  steps  may  be  sorting  a  file  or  creating  an  index  on  an  attri¬ 
bute  (f  Gotlieb  1975],  [Haerder  1977],  [Blasgen  and  Eswaran  1977]).  Other  steps 
may  involve  retrieving  indices  from  the  secondary  storage,  comparing  RID’s, 
retrieving  a  set  of  records  indicated  by  a  set  of  RID’s,  or  sequentially  scanning 
the  file.  A  sequence  of  steps  that  may  be  followed  in  order  to  evaluate  a  query  is 
called  an  access  method  or  access  strategy.  More  than  one  access  strategy  may 
be  possible  for  the  evaluation  of  the  same  query.  The  data  base  management 
system  should  select  the  access  strategy  with  the  least  cost  involved,  where  cost 
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Chapter  3:  Data  Base  Performance  Problems 


In  this  Chapter,  we  present  three  data  base  performance  evaluation  prob¬ 
lems:  the  selection  of  access  strategies  in  a  relational  environment,  the  index 
selection  problem  in  inverted  file  organizations,  and  the  design  of  data  base  sys¬ 
tem  performance  predictors.  We  review  the  re.search  done  previously  in  this 
area  and  point  out  its  limitations.  The  research  presented  in  this  thesis  is 
important  for  understanding  and  providing  satisfactory  solutions  for  these  prob¬ 
lems.  In  Chapter  8,  we  show  how  our  results  are  applied  to  these  problems.  The 
results  of  this  thesis,  however,  can  also  be  useful  in  a  variety  of  other  data  base 
performance  evaluation  studies. 


3.1  Selection  of  Access  Strategies 

Query  optimizers 

A  query  optimizer  is  a  program  that  selects  the  most  efficient  possible 
access  strategy  to  follow  for  the  evaluation  of  a  query.  Relational  queries  are 
suitable  for  optimization  since  the  languages  are  often  high  level  and  do  not 
specify  implementation  aspects.  Since  the  user  of  the  relational  data  base 
management  system  is  not  involved  in  the  choice  of  an  access  strategy,  the 
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query  optimizer  is  a  very  important  part  of  the  system.  For  large  relational 
data  bases,  the  cost  of  optimization  is  insignificant  in  comparison  to  the  benefits 
of  choosing  a  good  access  strategy.  When  a  query  is  executed  repeatedly,  the 
corresponding  object  program  can  be  stored  and  executed  repeatedly.  In  this 
case  the  optimization  has  to  be  done  only  once  (at  compile  time),  while  the 
benefits  of  the  optimization  are  seen  whenever  the  program  is  executed.  Thus, 
query  optimizers  in  relational  environments  can  reduce  considerably  the  cost  of 
query  evaluation,  while  having  a  comparatively  low  overhead.  (However, 
changes  to  the  data  base  contents  or  structure  may  alter  the  optimal  access 
strategy,  and  thus  periodic  recompilation  may  be  necessary). 


Single  file  environments 

The  optimization  of  query  evaluation  algorithms  in  single  file  environments 
has  received  the  attention  of  many  researchers.  Assuming  that  every  attribute 
of  the  file  is  inverted,  the  query  processor  can  retrieve  the  appropriate  inverted 
lists  and  perform  the  set  operations  described  in  the  query.  Thereafter,  it  can 
retrieve  the  qualifying  set  of  records.  Shapiro  [Shapiro  1970]  states  that  the 
time  required  for  parallel  processing  of  a  set  of  inverted  lists  is  proportional  to 
the  sum  of  the  lengths  of  the  inverted  lists,  if  the  lists  are  sorted.  He  considers, 
however,  only  epu  costs. 

For  the  case  that  only  two  buffers  are  available  in  the  main  memory  and 
every  domain  is  inverted,  Liu  [Liu  1976]  provides  a  set  of  algorithms  to  deter¬ 
mine  the  optimum  evaluation  trees  in  terms  of  the  number  of  comparisons 
required.  A  query  evaluation  tree  imposes  an  order  according  to  which  the  vari¬ 
ous  set  operations  specified  in  a  query  are  to  be  executed.  The  optimality  of  the 
evaluation  trees  is  based  on  properties  of  Huffman  trees  [Knuth  1973].  Strong 
simplifications  are  made,  however,  in  estimating  the  length  of  lists  of  qualifying 
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records  in  conjunctions  and  disjunctions. 

Fariey  and  Schuster  [Farley  and  Schuster  1975],  describe  an  algorithm  to 
evaluate  a  query  by  traversing  a  query  evaluation  tree,  choosing  its  cheaper 
branches  first.  The  cost  assigned  to  a  subtree  is  a  gross  estimate  of  the  io 
required  for  the  processing  of  the  subtree.  Branches  with  many  inverted  lists 
are  always  traversed  first.  The  algorithm  assumes  that  all  the  relevant  inverted 
lists  (or  at  least  one  block  from  each)  are  in  main  memory  at  the  same  time,  so 
that  they  can  be  processed  in  parallel. 

An  analytic  model  for  evaluating  the  performance  of  single  inverted  file 
organizations  was  reported  by  Cardenas  [Cardenas  1975].  It  takes  as  input 
three  sets  of  parameters  that  describe  the  file  contents,  the  queries,  and  the 
hardware  configuration.  The  output  is  the  expected  response  time  of  the 
queries.  A  disjunctive  normal  form  for  queries  is  assumed.  It  is  also  assumed 
that  every  attribute  is  inverted.  To  evaluate  a  query  in  disjunctive  normal  form, 
all  the  conjunctive  conditions  are  processed  first  using  the  indices  of  all  the 
attributes  participating  in  the  conjunction.  This  is  not  always  the  best  strategy 
to  evaluate  a  query  in  disjunctive  normal  form,  however,  even  vfhen  all  the  attri¬ 
butes  are  inverted. 

Kerschberg  et  al.  [Kerschberg  et  al.  1980]  proposed  a  model  for  optimum 
query  evaluation  in  a  single  file  (as  well  as  in  a  multi-file)  environment.  They 
take  into  account  the  fact  that  using  ail  the  indices  to  process  queries  may  not 
always  be  the  best  strategy.  Their  model  however,  treats  only  the  class  of 
queries  that  are  in  disjunctive  normal  form.  We  will  discuss  the  assumptions  of 
this  model  in  later  Chapters. 

In  conclusion,  the  existing  models  for  evaluating  performance  in  single 
inverted  file  environments  are  generally  limited  in  the  class  of  queries  that  they 
examine  and  by  the  use  of  algorithms  that  deperid  on  simple  assumptions  about 
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data  base  contents  and  data  placement  on  devices.  We  show  in  subsequent 
Chapters  that  these  assumptions  may  not  be  satisfied  in  some  data  base 
environments,  and  discuss  implications  of  this  result  for  query  evaluation.  In 
Chapter  8  we  present  a  set  of  algorithms  for  query  evaluation  in  fiat  file  environ¬ 
ments  for  a  wider  class  of  queries  using  a  more  detailed  model  of  data  base  con¬ 
tents  and  data  placement  on  devices. 

Multirslational  environTnents 

In  relational  environments  the  join  is  a  frequent  and  expensive  operation. 
The  performance  of  a  specific  join  algorithm  depends  on  many  factors  such  as 
the  availability  of  indices  on  the  joining  attributes,  the  order  of  the  records  in 
the  files,  the  distribution  of  values  of  the  joining  domains,  and  the  size  of  the 
relations  to  be  joined.  Before  the  join  is  processed,  it  may  be  useful  to  create 
indices,  sort  the  relations,  or  form,  projections  on  the  joining  attributes.  This  is 
called  relation  preparation  [Kershberg  et  al.  1980].  The  choice  of  a  join  algo¬ 
rithm  is  made  difficult  because  of  the  many  alternatives  available  ([Gotlieb 
1975],  [Rothnie  1975],  [Blasgen  and  Eswaran  1977],  [Ozkarahan  et  al.  1977]). 

The  serial  join  (nested  loop  join  [Selinger  et  al.  1979],  join  using  tuple  sub¬ 
stitution  [Wong  and  Youssefi  1976])  examines  one  tuple  of  the  first  relation  (in 
the  order  that  they  appear  in  the  relation)  at  each  step  and  finds  all  the 
matches  in  the  second  relation.  The  serial  join  can  be  useful  for  small  file  sizes 
when  there  are  no  indices  and  no  clustering  on  the  joining  attributes. 

The  parallel  join  (merge  join  [Selinger  et  al.  1979],  parallel  join  [Yao  and  De 
Jong  1978b])  examines  at  each  step  all  the  tuples  of  the  first  relation  with  a 
specific  value  in  the  joining  attribute  (in  increasing  order  of  attribute  values) 
and  finds  ail  the  matches  in  the  second  relation.  This  algorithm  can  be  useful 
when  both  relations  are  sorted  or  have  indices  on  the  joining  attribute. 
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The  semi-join  algorithm  uses  a  projection  or  an  index  on  the  joining  attri¬ 
bute  of  one  relation  to  first  eliminate  the  records  of  the  other  relation  that  do 
not  participate  in  the  join.  The  semi-join  algorithm  can  be  useful  in  distributed 
data  base  environments  [Kerschberg  et  al.  1980]. 

When  a  series  of  relational  algebra  operations  has  to  be  executed,  the 
order  of  execution  affects  the  cost.  Since  restriction,  selection  and  projection 
operations  decrease  the  file  size,  it  may  be  reasonable  to  process  them  as  soon 
as  possible.  In  addition,  when  they  refer  to  the  same  relation,  they  could  be 
grouped  together  such  that  a  specific  record  is  examined  only  once.  Smith  and 
Chang  [Smith  and  Chang  1975],  provide  a  set  of  heuristic  query  tree  transforma¬ 
tions  based  on  the  above  ideas.  After  the  tree  transformations,  the  most 
efficient  implementation  of  the  operators  at  each  node  of  the  tree  is  determined 
such  that  operators  make  good  use  of  accessing  mechanisms  provided  by  lower 
operators  in  the  tree.  However,  delaying  a  join  as  far  as  possible  is  not  always 
the  best  choice  since  the  join  may  be  a  very  selective  operation.  Thus  good  esti¬ 
mates  of  join  selectivities  are  crucial  for  selecting  an  optimum  strategy  for 
query  processing. 

Various  other  researchers  ([Blasgen  and  Eswaran  1976],  [Yao  and  De  Jong 
1978b]),  have  analyzed  the  cost  of  processing  a  general  relational  algebra  query 
consisting  of  restrictions,  projections  and  joins,  in  an  environment  invohdng  two 
relations.  In  [Yao  and  De  Jong  1978b],  many  possible  access  strategies  are 
analyzed  in  terms  of  secondary  storage  accesses.  They  are  compared  to  one 
another  as  well  as  to  other  models  for  various  values  of  the  input  parameters.  It 
is  shown  that  the  choice  of  a  good  access  path  depends  heavily  on  the  contents 
of  the  data  base  as  described  by  the  selectivities  of  the  various  operators. 

In  multirelational  query  environments,  the  order  of  processing  joins  is 
important  to  the  performance  of  the  system.  There  are  tvv’-o  reasons  for  this. 
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The  first  is  that  it  is  desirable  to  keep  the  size  of  intermediate  relations  as  small 
as  possible  to  avoid  an  excessive  number  of  comparisons  and  block  accesses. 
The  second  is  that  it  is  desirable  to  pipeline  the  operations  as  much  as  possible 
to  avoid  storing  intermediate  results  in  secondary  storage.  An  approach  to  the 
solution  of  this  problem  is  described  by  Wong  and  Youssefi  [Wong  and  Youssefi 
1976].  The  query  processor  first  decomposes  a  general  query  into  its  irreducible 
components. Then,  the  query  processor  orders  the  irreducible  components  for 
processing.  In  one  irreducible  component,  it  first  processes  the  one  variable 
subqueries  (subqueries  that  refer  to  only  one  relation).  Then  it  recursively 
chooses  the  smallest  reduced  size  relation  for  tuple  substitution  to  derive  a  new 
set  of  smaller  size  queries  until  only  one  variable  queries  remain.  In  tuple  sub¬ 
stitution,  an  n  variable  query  is  replaced  by  a  family  of  (n-l)  variable  queries 
resulting  from  substitution  for  one  of  its  variables  tuple  by  tuple.  The  relation 
resulting  from  one  irreducible  component  is  passed  to  the  next.  However  this 
heuristic  approach  does  not  make  good  use  of  information  about  existing  organi¬ 
zation  methods  or  contents  of  the  data  base. 

In  System  R  [Astrahan  et  al.  1976],  the  optimization  of  query  processing  is 
based  on  statistics  maintained  by  the  system.  Two  methods  are  provided  for 
joining  relations;  a  nested  loop  join  and  a  merge  join.  To  find  an  optimum  plan 
for  joining  n  relations  in  System  R  [Selinger  et  al.  1979],  a  tree  of  possible  solu¬ 
tions  is  constructed  and  the  cost  of  each  branch  is  evaluated.  A  solution  con¬ 
sists  of  an  ordered  list  of  relations  to  be  joined,  the  joining  method  used  for  the 
join,  and  a  plan  indicating  how  each  relation  is  to  be  accessed.  Only  joining  ord¬ 
ers  that  allow  pipelining  of  records  through  the  joins  are  considered.  Interest¬ 
ing  orders  are  those  that  produce  tuples  ordered  according  to  the  values  of  a 

+  A  query  is  called  reducible,  if  it  can  be  replaced  by  a  sequence  of  two  subqueries  that  over¬ 
lap  in  only  one  variable.  It  has  been  experimentally  shown  that  these  components  form  an 
advantageous  sequence  of  subqueries  [Youssefi  and  Wong  1978], 
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domain  specified  in  a  join  or  in  a  GROUP  BY  or  ORDER  BY  statement.  For  a  sin¬ 
gle  relation,  the  cheapest  access  path  that  will  produce  tuples  not  in  an  interest¬ 
ing  order  is  saved,  together  with  all  the  cheapest  access  strategies  that  will  pro¬ 
duce  tuples  in  interesting  orders.  After  that,  the  tree  is  constructed,  working 
from  the  leaves  to  the  root. 

In  summary  then,  the  optimization  of  access  strategies  in  multirelational 
environments  presents  more  difficulties  because  of  the  larger  number  of 
options.  At  the  same  time,  multirelational  operations  can  be  very  expensive  and 
are  thus  worthy  of  optimization.  Although  various  heuristics,  as  well  as  simple 
analytic  models  have  already  proven  useful  for  optimization  of  queries,  more 
detailed  modelling  may  lead  to  significant  additional  savings.  Examples  of  sav¬ 
ings  that  may  occur  by  using  more  detailed  models  are  presented  in  Chapter  8. 

Distributed  environments 

The  selection  of  access  strategies  becomes  harder  in  distributed  environ¬ 
ments.  The  reason  is  that  additional  factors  affect  the  performance  of  the  algo¬ 
rithms.  These  factors  are  the  telecommunication  cost,  the  architecture  and  the 
type  of  the  network,  the  existence  of  multiple  copies,  and  the  various  access 
times  of  the  secondary  storage  devices  in  the  sites  of  the  network. 

Since  data  transfer  between  sites  is  an  expensive  operation,  some  analytic 
models  consider  the  telecommunication  cost  as  the  only  cost  of  the  system.  In 
that  case,  the  semi-join  may  be  a  convenient  way  to  form  the  join,  because  it 
aims  to  reduce  the  sizes  of  the  relations  before  it  transfers  them  between  sites. 
Goodman  et  al.  [Goodman  et  al.  1979]  describe  query  processing  in  SDD-1,  a  Sys¬ 
tem  for  Distributed  Databases.  Queries  are  submitted  to  SDD-1  in  a  high  level 
language.  The  query  optimizer  first  transforms  the  queries  in  to  a  relational  cal¬ 
culus  form  called  an  envelope.  Then  local  processing  in  each  site  is  done  to 


-  18  - 


reduce  the  number  of  records  in  the  data  base  that  must  be  moved  between 
sites  in  order  to  evaluate  the  query.  Next,  "profitable”  semi-joins  are  considered 
for  reducing  farther  the  number  of  data  moves  between  the  sites.  The  cost  and 
benefit  of  semi-joins  is  estimated  from  simple  statistics  that  the  system  keeps  in 
order  to  estimate  selectivities.  From  the  set  of  "profitable"  semi-joins,  the  least 
expensive  semi-join  is  chosen  at  each  step.  Finally,  when  no  more  "profitable" 
semi-joins  exist,  the  site  where  the  answer  to  the  query  will  be  formed  is  chosen 
and  the  selected  portions  of  the  data  base  are  moved  to  this  site  in  order  to 
evaluate  the  query.  However,  Bernstein  and  Chiu  [Bernstein  and  Chiu  1981]  have 
shown  that  semi-joins  are  not  always  useful  for  reducing  the  amount  of  data  to 
be  transferred  between  sites.  Since  SDD-1  considers  only  semi-joins  for  query 
optimization,  a  good  strategy  may  not  always  be  found. 

Kevner  and  Yao  [Hevner  and  Yao  1979]  also  consider  semi-joins  for  query 
processing  in  a  distributed  data  base  environment.  It  is  assumed  that  only  data 
transfers  between  sites  incur  a  cost  in  the  system,  and  that  the  cost  depends 
linearly  on  the  amount  of  data  transferred  (Cq+CiX,  where  Cq  is  the  cost  of 
establishing  communication  between  sites,  and  x  is  the  amount  of  data 
transferred).  Optimum  response  time  and  optimum  total  telecommunication 
cost  algorithms  for  a  single  job  in  the  system  are  presented  (algorithm  C)  for 
the  case  where  relations  are  constructed  from  a  single  domain,  and  the  ordering 
of  relations  according  to  their  selectivities  is  the  same  as  the  ordering  of  rela¬ 
tions  according  to  their  size  in  bytes.  A  more  general  algorithm  (algorithm  G) 
presented  in  the  paper  performs  an  exhaustive  search  for  profitable  semi-joins. 
It  is  shown  by  an  example  that  the  algorithm  may  lead  to  large  reductions  in  the 
amount  of  data  that  must  be  transferred  between  sites  in  order  to  evaluate  the 
query. 


The  computing  cost  is  also  important  for  the  system  performance.  Query 
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processing  in  a  distributed  environment  with  minimum  telecommunication  cost 
or  minimum  response  time  criteria  is  examined  by  Epstein  et  al.  [Epstein  et  al. 
1978].  The  system  cost  is  estimated  as  the  weighted  sum  of  the  telecommunica¬ 
tion  cost  and  the  computing  cost.  For  minimum  telecommunication  cost,  the 
minimum  amount  of  data  should  be  transferred  between  sites,  while  for 
minimum  response  time  it  may  be  desirable  to  equalize  the  fragments  of  a  rela¬ 
tion  in  the  processing  sites.  The  algorithm  for  query  processing  presented  in 
the  paper  processes  in  each  site  all  the  one  variable  subqueries.  Next,  it  applies 
the  reduction  algorithm  to  find  the  irreducible  components  of  the  query.  Finally 
it  selects  the  sites  where  the  components  will  be  processed,  and  moves  the 
required  pieces  of  the  relations  to  those  sites.  The  choice  of  the  sites  depends 
on  the  optimization  criteria  as  well  as  on  the  kind  of  the  network  (site  to  site  or 
broadcast).  The  paper  presents  some  heuristic  rules  for  the  choice  of  the  pro¬ 
cessing  strategy. 

Kerschberg  et  al.  [Kerschberg  et  al.  1980],  examine  optimal  query  process¬ 
ing  in  a  star  network  configuration.  The  network  consists  of  a  central  computer 
and  at  least  one  local  computer  connected  to  the  central  site  via  communication 
lines.  The  objective  of  the  optimization  is  to  minimize  query  response  time. 
Various  methods  of  processing  the  join  are  examined  and  simulation  results  are 
presented  for  various  parameter  values.  The  results  show  that  high  telecom¬ 
munication  speeds  favor  the  join  formation  in  the  faster  central  site  and  that 
the  size  of  the  resulting  relation  is  an  important  factor  for  determining  the  site 
of  the  join  formation.  It  is  shown  that  semi-joins  become  important  for  large 
files,  small  join  selectivities,  and  low  telecommunication  speeds. 

Semi-joins  have  been  used  in  distributed  data  base  environments  in  order 
to  minimize  the  cost  of  data  transfers  between  sites.  However  semi-joins  are  not 
always  profitable.  Analytic  models  that  help  the  optimizer  choose  between 
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semi-joins  and  other  ways  of  forming  the  join  by  taking  into  account  telecom¬ 
munication  costs,  processing  costs,  and  queueing  costs  should  be  considered. 


3.2  Index  Selection 

Maintaining  one  index  for  every  attribute  in  a  file  may  be  costly  since  it 
involves  high  space  utilization  and  maintenance  costs.  An  alternative  approach 
is  to  maintain  indices  for  selected  attributes  only.  However  the  choice  of  an 
optimum  set  of  indices  is  not  an  easy  one.  Thus,  analytic  models  have  been  used 
for  the  automatic  selection  of  indices.  The  cost  function  of  such  models  has  to 
account  for  the  space  and  maintenance  costs  in  the  presence  of  an  index,  as 
well  as  for  the  higher  retrieval  cost  in  the  absence  of  an  index.  A  good  index  is 
one  that  reduces  the  overall  system  cost  by  its  presence. 

Early  models  ([Lum  and  Ling  1971],  [King  1974],  [Yue  and  Wong  1975]),  pro¬ 
vided  solutions  to  the  problem  only  for  single  attribute  queries.  However, 
indices  are  mostly  useful  in  a  multi-attribute  retrieval  environment. 

The  index  selection  problem  in  a  multi-attribute  retrieval  environment  was 
first  examined  by  Stoncbraker  [Stonebraker  1974],  and  Schkolnick  [Schkolniek 
1975].  Their  models  allow  only  conjunction  of  attribute  values  in  queries  and 
assume  a  specific  query  evaluation  procedure.  The  probability  of  an  attribute 
appearing  in  a  query  is  assumed  to  be  given,  and  it  is  independent  of  the  pres¬ 
ence  of  other  attributes  in  the  same  query.  However,  that  is  unrealistic,  since 
certain  attributes  tend  to  appear  together  in  queries  and  others  never  appear 
together  in  the  same  query. 

Another  approach  ([Anderson  and  Berra  1977],  [Hammer  and  Chan  1976]), 
places  queries  into  classes  depending  on  which  attributes  appear  in  the  query. 
For  each  class,  the  frequency  of  the  class  is  estimated  by  the  queries  run  by  the 
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system  in  a  particular  time  interval.  The  cost  of  the  class  is  then  estimated 
under  a  given  set  of  indices  and  is  added  to  the  total  cost.  Thus,  this  approach 
considers  some  information  on  the  correlation  of  attributes  in  queries.  Hammer 
and  Chan  suggest  that  further  query  correlation  information  may  be  gained  if 
certain  very  frequent  queries  are  examined  as  individual  classes,  while  less  fre¬ 
quent  queries  are  grouped  into  other  classes. 

Given  that  the  cost  function  is  known,  an  approach  to  index  selection  is  to 
evaluate  the  total  system  cost  for  each  possible  index  set  and  then  select  the 
set  of  domains  that  gives  the  smallest  cost.  However  for  m  domains  in  the  rela¬ 
tion,  there  are  2'^  possible  choices  of  index  sets.  For  moderate  values  of  m  this 
number  can  be  very  large,  therefore,  possible  ways  to  reduce  the  search  space 
should  be  investigated.  A  method  to  do  that  [Chan  and  Hammer  1976]  is  to  look 
first  for  all  the  single  indices  which  will  decrement  the  total  system  cost,  and 
when  no  other  single  index  can  reduce  the  cost  of  the  system,  to  look  for  possi¬ 
ble  combinations  of  indices.  One  reason  for  looking  for  the  combinations  is  that 
attributes  may  participate  in  disjunctive  queries. 

The  index  selection  problem  in  single  inverted  files  has  been  given  consider¬ 
able  attention  by  various  researchers.  Solutions  however  for  the.  problem  in 
clustered  files  or  multifile  environments  have  not  been  given.  Moreover,  the 
effect  of  the  various  assumptions  of  the  models  on  the  selection  of  indices  has 
not  yet  been  examined.  In  Chapter  5  we  present  some  implications  of  the  distri¬ 
bution  of  the  attribute  values  for  the  index  selection  problem.  Examples  of  pos¬ 
sible  performance  improvement  due  to  the  selection  of  a  better  set  of  indices 
are  given  in  Chapter  8. 
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3.3  Physical  Data  Base  Design  Performance  Predictors 

Physical  data  base  design  performance  predictors  are  programs  that  take 
as  input  a  specific  logical  data  base  design,  information  about  the  queries,  and 
software  and  hardware  characterestics  of  the  system,  and  produce  as  output  an 
estimation  of  the  performance  of  the  data  base  under  a  physical  data  base 
design.  If  the  user  is  not  satisfied  with  the  predicted  performance,  he  can  try 
various  changes  in  the  physical  design  (like  providing  more  indices,  using  alter¬ 
native  clustering,  or  extending  the  buffer  size).  The  performance  predictor  will 
predict  the  performance  of  the  system  under  the  new  design,  and  the  user  can 
iterate  until  he  is  satisfied. 

The  use  of  data  base  design  performance  predictors  is  justified  by  the  fact 
that  the  performance  of  a  data  base  depends  on  many  interdependent  factors. 
Therefore  the  cost  equations  are  difficult  to  optimize.  Moreover,  the  develop¬ 
ment  of  good  general  heuristics  that  work  well  in  every  case  is  not  easy.  The 
problem  becomes  even  more  difficult  when  response  time  is  the  performance 
criterion  because  precise  estimates  of  the  response  time  arc  required.  Thus, 
the  design  of  good  data  base  performance  predictors  is  an  interesting  and 
difficult  problem. 

Nakamura  et  al.  [Nakamura  et  al.  1976]  report  on  a  data  base  performance 
predictor  based  on  a  simulation  model.  The  system  predicts  the  response  time 
of  queries  in  hierarchical  data  structures.  The  advantage  of  a  simulation  model 
is  that  it  can  incorporate  complex  effects  like  concurrent  accesses,  channel 
contention,  operating  system  overheads,  and  buffer  management.  However  it 
provides  greater  accuracy  at  the  price  of  much  higher  computation  cost.  This 
can  be  crucial  if  a  large  number  of  alternative  designs  is  to  be  examined.  More¬ 
over  the  input  for  simulation  models  is  typically  required  to  be  more  detailed. 
These  detailed  values  may  not  be  available  during  the  data  base  design  phase. 
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Teorey  and  Oberlander  [Teorey  and  Oberlander  1973]  present  a  hybrid  data 
base  design  evaluator  (DBDE).  The  DBDE  is  an  extension  of  an  analytic  model 
called  File  Design  Analyzer  [Teorey  and  Das  1976]  that  handles  the  effect  of 
buffering  and  overflow  of  records  due  to  the  data  base  growth.  The  design 
evaluator  also  finds  bounds  on  the  io  time  due  to  sharing  of  devices,  as  well  as 
response  time  delays  due  to  multiprogramming  using  a  queueing  network 
model.  The  model  is  designed  to  analyze  the  performance  in  a  specific  data 
base  management  system  (IDS). 

Gambino  and  Gerritsen  [Gambino  and  Gerritsen  1977]  developed  a  data 
base  design  support  system  to  aid  in  the  design  of  a  data  base  based  on  the 
CODASYL  model.  The  system  is  based  on  an  analytic  performance  evaluation 
model.  The  data  base  designer  inputs  a  description  of  the  conceptual  schema  as 
well  as  information  about  the  access  strategies  followed  when  various  queries 
are  submitted  to  the  system.  The  designer  is  able  to  select  for  each  record  type 
the  location  mode,  the  clustering  attribute,  and  the  desired  set  implementation 
(pointer  array,  chains  with  next,  etc.).  Questions  involving  indexing  and  area 
placement  wil]  be  considered  in  the  future.  A  nice  feature  of  the  system  is  its 
capability  of  storing  a  solution  upon  request.  This  allows  the  designer  to  explore 
other  alternatives,  and  if  not  satisfied,  to  restore  the  previous  solution.  The  sys¬ 
tem  does  not  help  the  designer  in  the  search  of  a  good  solution,  but  is  a  useful 
tool  to  provide  a  check  to  see  if  ail  constraints  are  satisfied,  and  to  perform  an 
evaluation  of  the  current  solution. 

Pezarro  [Pezarro  1980]  presents  a  data  base  performance  predictor  based 
on  a  DBMS  independent  analytic  model.  The  paper  stresses  the  need  for  fast  but 
sufficiently  accurate  physical  data  base  performance  estimation,  such  that  it 
allows  ranking  of  physical  designs.  Pezarro  presents  experimental  results  on  a 
large  data  base  environment  when  .simple  requests  are  given  to  the  data  base. 
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Sevcik  [Sevcik  1981]  presents  an  analytic  hierarchical  model  for  data  base 
system  performance  prediction.  In  the  lowest  level  the  model  uses  queueing 
network  models  to  derive  device  utilizations  and  response  times.  In  higher  lev¬ 
els  the  workload  is  described  with  parameters  representing  various  degrees  of 
detail.  By  analytic  techniques  the  workload  of  one  level  can  be  transformed  into 
the  workload  of  a  lower,  more  fully  specified  level.  The  objective  is  that  many 
data  base  design  alternatives  can  be  represented  by  changes  at  a  single  level  of 
the  layered  model. 

The  development  of  data  base  performance  predictors  is  required  for  help¬ 
ing  with  the  design  and  tuning  of  large  expensive  systems  with  a  large  number  of 
parameters  and  complex  requirements.  However,  in  order  to  choose  between 
alternative  physical  data  base  designs  or  to  satisfy  responce  time  requirements, 
the  cost  predictions  should  be  quite  precise.  That,  in  turn,  requires  detailed 
modelling  of  the  factors  that  affect  the  system  performance.  In  Chapters  4  and 
5  we  show  that  the  assumptions  used  for  modelling  data  base  contents,  data 
placement  on  devices  and  user  requests  may  lead  to  large  errors.  In  Chapters  6 
and  7  we  present  more  detailed  modelling  techniques. 
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Chapter  4:  Selectivities 


Selectivities  are  quantities  used  by  analytic  models  in  order  to  estimate  the 
cpu  and  io  activity  in  a  variety  of  data  base  performance  evaluation  problems 
such  as  the  ones  described  earlier.  In  this  chapter  we  describe  various  kinds  of 
selectivities,  their  use  in  system  performance  evaluation,  and  how  earlier 
models  estimate  the  selectivities.  Next,  we  discuss  the  assumptions  used  by 
these  models  for  the  estimation  of  selectivities.  We  show  that  these  assumptions 
are  very  restrictive,  and  we  give  realistic  examples  of  common,  large  data  bases 
where  these  assumptions  are  not  satisfied. 

4.1  Kinds  and  Use  of  Selectivities 

Various  kinds  of  selectivities  are  used  in  the  system  cost  equations  of  previ¬ 
ous  models:  restriction  selectivities  refer  to  the  proportion  of  records  qualifying 
in  a  restriction:  selection  selectivities  refer  to  the  proportion  of  records  qualify¬ 
ing  in  a  selection:  join  selectivities  refer  to  the  proportion  of  the  records  of  a  file 
that  are  joined  vnth  the  records  of  another  file.  The  selectivities  multiplied  by 
the  number  of  records  of  a  file,  give  the  total  number  of  records  of  the  file  that 
qualify  in  a  step  of  a  query  evaluation.  Thus,  they  provide  a  measure  of  the  sys¬ 
tem  cost. 

In  the  case  that  joins  of  more  than  two  relations  are  formed,  the  number  of 
tuples  in  a  join  (size  of  the  join)  is  also  important  for  the  system  cost.  It  is 
desirable  to  keep  the  sizes  of  the  intermediate  relations  small.  Moreover,  the 
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size  of  the  join  is  also  very  important  for  choosing  the  site  of  the  join  formation 
in  distributed  systems.  The  computers  of  different  sites  may  have  storage  dev¬ 
ices  with  different  speeds.  In  this  case,  the  join  may  be  formed  in  a  remote  site 
with  fast  io  devices  and  then  moved  to  the  local  site.  If  a  projection  has  to  be 
performed  before  a  join,  the  number  of  tuples  that  will  participate  in  the  join 
may  be  further  reduced.  Selectivity  of  a  projection  is  the  number  of  tuples 
remaining  after  a  projection. 

We  will  use  the  general  term  record  selectivity  in  order  to  denote  the 
number  of  records  that  the  system  is  dealing  with  in  a  step  of  the  query  evalua¬ 
tion.  In  order  to  estimate  the  record  selectivity  in  a  step  of  the  query  evalua¬ 
tion,  Ql  model  of  data  base  contents  is  necessary. 

We  will  call  block  selectivity  the  number  of  blocks  of  the  file  that  contain 
records  that  qualify  in  a  step  of  a  query  evaluation.  We  will  call  projectivity  the 
proportion  of  a  file  in  bytes  that  remmins  after  a  projection  on  a  set  of  attributes 
of  the  file.  Block  selectivities  and  projcctivities  are  used  to  measure  an  impor¬ 
tant  cost  of  the  system,  the  io  activity.  In  order  to  estimate  the  number  of 
blocks  that  contain  a  set  of  records  that  need  to  be  retrieved  from  secondary 
storage  in  a  step  of  query  evaluation,  we  need  to  know  where  this  set  of  records 
is  located  in  the  storage  among  the  other  records.  This  requires  a  model  of 
record  placement  in  a  file. 

Average  record  selectivity  is  the  average  number  of  records  in  a  file  that 
qualify  in  a  set  of  queries  invohdng  certain  attributes  of  the  file,  while  average- 
block  selectivity  is  the  average  number  of  blocks  of  the  file  containing  those 
records.  The  average  selectivities  are  used  in  evaluating  various  organization 
methods  such  as  indexing.  In  order  to  estimate  the  average  selectivities,  it  is 
important  to  describe  the  distribution  of  queries  in  a  file.  This  requires  model 
of  query  distribution  in  a  file. 
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4.2  Estimation  of  Selectivities  in  Previous  Models 

The  estimation  of  selectivities,  as  was  mentioned  previously,  is  based  on 
models  describing  the  data  base  contents,  the  record  placement  in  the  files,  and 
the  query  distribution  in  the  files.  The  existing  models  make  various  assump¬ 
tions  about  the  data  base  contents,  the  record  placement  in  the  files  and  the 
query  distribution  in  the  files.  Based  on  those  assumptions  and  statistics  kept 
by  the  system,  they  estimate  the  selectivities  and  thus  the  system  cost. 

The  uniformity  assumption  has  been  used  to  model  the  distribution  of 
values  of  a  single  attribute  in  its  domain.  Each  attribute  is  assumed  to  have  a 
domain  of  m  distinct  values.  The  number  of  records  in  a  file  that  have  a  value  v 
in  the  attribute  A  is  calculated  as  N /m,  where  N  is  the  number  of  records  in  the 
file.  Using  the  uniformity  assumption,  the  record  selectivities  of  single  attri¬ 
bute  conditions  can  be  approximated. 

The  independence  assumption  has  been  used  to  estimate  record  selectivi¬ 
ties  in  queries  where  more  than  one  attribute  is  involved.  According  to  this 
assumption  the  attribute  values  of  any  two  attributes  A1  and  A2  are  indepen¬ 
dent.  Thus,  the  number  of  records  qualifying  in  a  conjunctive  condition  on  two 
attributes  A1  and  A2  is  estimated  as  N*s  ^*S2  where  Si  and  S2  are  the  proportions 
of  the  file  qualifying  in  the  single  attribute  conditions  on  A1  and  A2  respectively. 

The  estimation  of  the  record  selectivities  in  many  performance  studies  is 
based  on  a  uniformity  and  independence  assumption  model  of  data  base  con¬ 
tents  ([Cardenas  1976],  [Yao  and  De  Jong  1978b],  [Selinger  et  al.  1979]).  The 
selection  selectivity  in  an  attribute  equality  condition  is  estimated  by  Selinger 
et  al.  [Selinger  et  al.  1979],  as  N /Max{mx,m2),  where  mi  and  mg  are  the 
number  of  distinct  values  of  the  two  attributes.  The  join  selectivity  in  an  equi- 
join  is  estimated  by  Yao  and  De  Jong  [Yao  and  De  Jong  1978b],  as  N*S2,  where  N 
is  the  size  of  the  file  and  ss  the  proportion  of  distinct  values  in  the  joining 
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domain  of  the  other  file.  The  size  of  the  join  in  System  R  is  estimated  as 
N i*N2*s  i*S2,  where  A^i  and  No  are  the  sizes  of  the  two  relations,  and  Sj  and  S2 
are  the  join  selectivities.  In  System  R,  the  estimation  rules  imply  that  all  the 
distinct  values  of  the  joining  domain  remain  after  the  restriction  on  another 
attribute.  Yao  and  De  Jong  transform  the  problem  of  finding  the  number  of 
remaining  values  to  a  problem  of  random  placement  from  a  uniform  distribu¬ 
tion.  For  small  numbers  of  distinct  values  in  the  joining  attribute  domain,  this 
estimation  practically  implies  that  all  the  distinct  values  of  the  joining  domain 
will  remain  after  the  restriction. 

The  estimation  of  block  selectivities  in  the  various  existing  analytic  models 
is  based  on  the  random  placement  assumption.  The  records  to  be  retrieved  are 
assumed  to  be  randomly  placed  among  the  blocks  of  a  file.  It  is  implied  that 
each  record  in  a  file  has  the  same  probability  of  being  selected.  In  many  cases 
files  are  clustered  according  to  the  attribute  values  of  an  attribute.  To  our 
knowledge,  models  estimating  block  accesses  in  clustered  files  have  not 
appeared  in  the  literature. 

The  estimation  of  average  selectivities  necessitates  the  modelling  of  the 
queries  to  the  data  base.  Early  models  assumed  that  the  occurrence  of  attri¬ 
butes  in  queries  is  independent  of  the  presence  of  other  attributes  in  the  query. 
We  will  call  this  assumption  independence  of  attributes  in  queries  assumption. 
However,  in  most  environments,  certain  groups  of  attributes  tend  to  appear 
together  in  queries.  This  aflects  the  choice  of  a  good  file  organization  method. 
The  same  models  assume  that  all  the  queries  that  refer  to  a  specific  attribute 
are  uniformly  distributed  in  all  the  values  of  the  attribute.  We  will  call  this  the 
uniformity  of  queries  assumption. 

Recently,  more  detailed  approaches  have  been  reported  in  the  literature. 
Kershberg  et  al.  [Kershberg  et  al.  1980]  separated  the  attributes  into  two 
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classes:  1)  attributes  that  have  values  following  basically  a  uniform  distribution 
with  a  small  number  of  outliers  (attribute  values  with  large  number  of  records 
involved),  and  2)  attributes  that  have  attribute  values  following  exponential  or 
Zipf-like  distributions  [Knuth  1973]  when  sorted  on  frequency  of  occurrence.  In 
the  first  case  the  outliers  are  stored  and  the  other  values  are  approximated  with 
a  uniform  distribution.  In  the  second  case  the  values  which  occur  more  fre¬ 
quently  are  stored  and  the  remaining  values  are  approximated  by  uniform  dis¬ 
tributions.  Demolombe  [Demolombe  1980]  presented  an  approach  based  on  uni¬ 
formity  and  independence  assumptions  with  many  additional  parameters  in 
order  to  reduce  the  effect  of  attribute  value  dependencies  (some  of  these  are 
the  size  of  all  possible  projections,  the  size  of  equality  selections,  and  interela- 
tional  dependencies).  Richard  [Richard  1981]  showed  how  to  estimate  the  cost 
of  a  system  using  a  model  based  on  the  uniformity  and  independence  assump¬ 
tions  with  the  additional  information  of  sizes  of  all  possible  projections.  He 
reported  that  some  evaluation  of  the  method  did  not  show  clearly  under  which 
assumptions  the  method  worked.  All  the  above  methods  assume  that  the  quali¬ 
fying  records  are  randomly  placed  in  a  file. 

Recent  approaches  in  estimating  average  selectivities  categorize  the 
queries  in  groups  involving  certain  attributes  appearing  together  ([Hammer  and 
Chan  1976],  [Anderson  and  Berra  1977]).  Those  models  improve  accuracy  over 
the  independence  of  attributes  in  queries  assumption.  However,  in  all  the  exist¬ 
ing  models,  the  selectivity  of  each  attribute  in  a  file  is  estimated  independent  of 
the  specific  query  category.  In  [Anderson  and  Berra  1977],  it  is  estimated  using 
the  uniformity  of  queries  assumption.  In  [Hammer  and  Chan  1976],  the  selec¬ 
tivity  of  an  attribute  in  a  query  is  estimated  as  the  fraction  of  the  records  that 
remain  after  the  evaluation  of  an  attribute  condition  involving  the  attribute. 
The  average  record  selectivities  are  estimated  as  the  average  of  the  selectivities 
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of  the  attribute  over  all  the  queries  in  -which  the  attribute  participated. 


4.3  Evidence  that  the  Assumptions  are  not  Satisfied 

We  will  demosLrate  in  Lliis  section  that  the  assumptions  of  uniformity  and 
independence  of  attribute  values,  random  placement  of  records  in  the  file,  and 
uniformity  of  queries  are  not  satisfied  in  many  data  base  environments. 

UnifoTTnity  assumption 

The  uniformity  assumption  may  be  unrealistic  for  describing  the  distribu¬ 
tion  of  attribute  values  of  an  attribute.  Data  bases  often  contain  information 
describing  populations  of  the  real  world.  Examples  of  such  populations  are  the 
employees  of  an  organization,  the  students  of  a  university,  the  people  under 
security  surveillance,  the  fish  of  an  aquarium,  or  the  cars  of  an  autoshop.  Such 
populations  usually  have  only  a  few  members  with  extreme  characteristics.  For 
example,  only  a  few  people  are  extremely  tall  or  extremely  short.  The  majority 
have  a  height  near  to  an  average  height.  Thus  the  values  of  some  attributes  of 
data  bases  describing  such  populations  tend  to  have  unimodal  distributions 
rather  than  uniform. 

As  an  example  of  a  population,  consider  the  population  of  professional 
engineers  of  Ontario.  A  recent  report  [Engineering  Report  1979]  provides  vari¬ 
ous  statistics  such  as  years  of  experience,  responsibility  level,  and  salary  infor¬ 
mation  for  this  population.  All  the  statistics  present  highly  unimodal  distribu¬ 
tion  of  values  rather  than  uniform.  In  Table  4.3.1  we  reproduce  the  statistics 
given  for  the  number  of  engineers  per  responsibility  level.  The  data  describing 
the  professional  engineers  of  Ontario  is  a  t5^pical  example  of  the  numerous  large 
data  bases  containing  information  about  populations.  We  -will  use  the  statistics 
included  in  the  engineering  report  for  many  examples  in  various  sections  of  the 
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thesis. 


Table  4.3.1 

RES-LEVEL 

^ENGINEERS 

F 

1036 

E 

2853 

D 

4345 

C 

3240 

B 

1671 

A 

1355 

As  a  second  example,  consider  the  population  of  teachers  in  Canadian 
Universities.  A  report  by  Statistics  Canada  [Teachers  in  Universities  1978-1979] 
provides  information  about  teachers  in  Canadian  Universities.  The  attribute 
values  of  their  attributes  present  highly  non-uniform  distributions.  In  Table 
4.3.2,  we  reproduce  the  number  of  teachers  per  age  interval.  In  Table  4.3.3,  we 
reproduce  the  statistics  on  the  attribute  CITIZENSHIP.  As  a  last  example  of 
non-uniformity  of  attribute  values,  the  number  of  male  teachers  in  Canadian 
Universities  was  27799,  while  the  number  of  female  teachers  was  4846. 


Table  4.3.2 

AGE 

^TEACHERS 

<=  25 

13 

25-29 

1458 

30-34 

5074 

35-39 

7658 

40-44 

6213 

45-49 

4824 

50-54 

3477 

55-59 

2294 

60-over 

1526 

Independence  assumption 

There  are  two  important  reasons  why  the  independence  assumption  is  not 
realistic  in  estimating  record  selectivities.  The  first  is  that  the  attribute  values 
of  certain  attributes  are  correlated.  Correlations  are  measures  of  the  covari¬ 
ance  of  the  attribute  values  of  two:  attributes  [Papoulis  1965].  The  existence  of 
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Table 

4.3.3 

CITIZENSHIP 

TEACHERS 

CANADIAN 

24693 

USA 

4000 

UN.KINGD 

1561 

OTK.  COM. 

623 

BELG 

464 

OTHER  EUR. 

591 

OTH 

462 

high  correlations  between  attribute  values  in  data  bases  is  a  direct  consequence 
of  the  fact  that  files  often  contain  information  about  populations  of  the  real 
world.  The  second  reason  that  the  independence  assumption  is  unrealistic  is 
that  often  in  actual  data  base  environments  pairs  of  attributes  do  not  assume  all 
the  possible  combinations  of  their  values.  For  example  the  pair  of  attributes 
DEPARTMENT  and  PROJECT  may  have  many  combinations  of  their  attribute 
values  which  do  not  exist  in  the  data  base  because  one  department  may  be 
involved  with  only  a  small  number  of  projects  and  vice  versa.  We  will  call  this 
value  subsetting. 

To  show  the  relationship  between  the  assumptions  of  independence,  no¬ 
correlations  and  no-value  subsetting,  consider  two  attributes  X  and  Y,  which  we 
assume  take  on  distinct  attribute  values.  Let  f(x,y)  be  the  two-dimensional 
density  distribution  of  the  two  attributes  X  and  Y.  We  can  think  of  f  (x,y)  as 
being  a  two-dimensional  histogram.  We  say  that  the  attribute  values  of  the  attri¬ 
butes  X  and  Y  are  independent  if  f{x,y)  =  p  {x)q  (y),  where  p{x)  and  q{y)  are 
the  density  distributions  of  the  attributes  X  and  Y.  We  say  that  the  attribute 
values  of  the  attributes  X  and  Y  are  not-corr elated  if  E{xy)  —  E  {x)E  (y)  =  0, 
where  the  symbol  E  denotes  expectation  for  all  the  records  in  a  file.  We  say  that 
there  is  no-value  subsetting  if  n(x,y)  ^  0  for  every  pair  of  values  x  of  X  and  y  of 
rwithp(2:)  ^  0  and  q(y)  ^  0.  It  is  easily  verified  that  independence  implies  no 
correlations  and  no  value  subsetting  since 
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i  i 

=  TiTi^iVjpMqiyj) 

j  i 

=  T,y3^iyj)Ti^iPM 

J  i 

=  E{y  )E{x) 

and  • 

7i{xi,y^)  -  Nf(xi,yj) 

=  Ep  i^i)q  (vj) 

which  is  always  non-zero  if  ^(a:^)  ^  0  and  giy-)  ^  Q.  '7 

In  addition,  we  can  show  that  value  subsetting  does  not  necessarily  imply 
correlations  and  vice  versa.  Figure  4.3.1  presents  an  example  where  value  sub- 
setting  exists  but  correlations  do  not.  The  numbers  in  the  boxes  indicate  the 
number  of  records  in  the  file  which  have  the  corresponding  pair  of  attribute 
values.  Figure  4.3.2  presents  an  example  where  correlations  exist  but  not  value 
subsetting.  Figure  4.3.3  presents  an  example  where  there  are  no  correlations 
and  there  is  no  value  subsetting,  but  the  attributes  are  not  independent.  Thus 
the  independence  assumption  is  a  strong  one.  It  is  strictly  stronger  than  both 
assumptions  of  no  correlations  and  no  value  subsetting.  The  relationship 
between  value  subsetting  and  correlations  will  be  discussed  in  more  detail  in 
Chapter  6. 

Many  attribute  pairs  describing  populations  are  correlated.  Consider  again 
the  engineering  data  base.  In  this  data  base,  attribute  pairs  like  (SALARY, 
YEARS-FROM-GRADUATION),  (SALARY.  RESPONSIBILITY-LE\^L).  (YEARS-FROM- 
GRADUATION,  RESPONSIBILITY-LEVEL),  are  highly  correlated.  In  Table  4.3.4  we 
reproduce  the  relationship  between  5?’ears  from  degree  and  responsibility  level 
for  engineers  in  responsibility  level  "A",  and  for  the  total  population  of 
engineers.  Note  that  there  are  only  a  few  engineers  in  responsibility  level  A  with 
many  years  of  experience.  As  a  second  example  consider  the  population  of 
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Attribute  2 
values 


Figure  4.3.1 

Value  subsetting  but  no  correlations 


Attribute  2 
values 


Figure  4.3.2 

Correlations  but  no  value  subsetting 


Attribute  2 
values 


Figure  4.3.3 

Value  dependencies  but  no  correlations  or  value  subsetting 
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Table  4.3.4 

YEAR-OF-GRADUATION 

#ENG.  IN  LEVEL  A 

TOTAL  #ENG. 

49-72 

19 

3609 

73 

10 

566 

74 

13 

590 

75 

27 

552 

76 

66 

511 

77 

316 

494 

78 

521 

574 

79 

315 

322 

teachers  in  Universities.  The  attributes  (TITLE,  HIGKEST-EARNED-DEGREE), 
(AGE,  TITLE),  (AGE,  SALARY),  {SALARY,  TITLE),  (SALARY,  HIGHEST-EARNED- 
DEGREE),  (SEX,  TITLE),  (NATIONALITY,  COUNTRY-OF-HIGHEST-EARNED-DEGREE), 
are  highly  correlated.  In  Table  4.3.5  we  reproduce  the  relationship  between  age 
group,  sex  and  title,  for  assistant,  associate  and  full  professors.  Note  the  strong 
correlations  both  between  age  and  title,  and  between  sex  and  title. 


Table  4.3.5 

Age-group 

Full-Prof. 

Assoc-Prof. 

Assist. Prof. 

M 

F 

M 

F 

M 

F 

S25 

0 

0 

0 

0 

2 

1 

25-29 

0 

1 

27 

3 

548 

151 

30-34 

37 

0 

994 

102 

2363 

464 

35-39 

606 

14 

3568 

336 

1985 

445 

40-44 

1678 

48 

2785 

306 

734 

252 

45-49 

1960 

82 

1715 

260 

360 

175 

50-54 

1815 

95 

868 

215 

174 

123 

55-59 

1358 

92 

438 

132 

82 

75 

^60 

929 

79 

224 

92 

63 

32 

In  real  data  bases,  value  subsetting  is  often  encountered.  Functional 
dependencies  between  two  attributes  [Tsichritzis  and  Lochovsky  1977]  represent 
special  cases  of  value  subsetting.  As  an  example,  consider  the  attributes 
DEPARTMENT#  and  PROJECT#  of  a  PROJECT  file.  If  each  department  supports 
separate  projects,  the  DEPARTMENT#  functionally  depends  on  PROJECT#.  How¬ 
ever,  certain  departments  may  have  a  few  projects  in  common.  The  relationship 
between  the  two  attributes  is  then  no  longer  functional,  but  since  only  a  few  pro- 
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jects  are  common  between  departments,  only  a  proper  subset  of  the  Cartesian 
product  of  attribute  values  exists  in  the  data  base.  Some  more  examples  are: 
(DEP#,  DEGREE-KIND),  (SUPPLIERj^^,  PART#),  (PART#,  MACHINE#),  (TEACHER, 
COURSE). 

Random  placement  assumption 

The  random  placement  assumption  is  not  realistic  when  the  placement  of 
records  in  a  file,  according  to  some  order,  results  in  concentration  of  the 
records  with  a  certain  attribute  value  in  a  small  portion  of  the  blocks  of  the  file. 
Such  concentrations  may  result  from  various  causes.  One  reason  may  be  that 
the  attribute  values  depend  on  the  length  of  the  time  that  the  records  have  been 
in  the  data  base.  If  new  records  are  always  added  at  the  end  of  the  file  for  exam¬ 
ple,  the  time  dependence  of  attribute  values  vidll  force  records  with  certain 
attribute  values  to  the  end  of  the  file.  Another  reason  may  be  that  the  data  base 
is  clustered  on  a  certain  attribute.  In  this  case,  attribute  values  of  attributes 
that  are  correlated  or  value  dependent  on  the  clustering  attribute  may  be 
located  in  a  small  portion  of  the  file. 

Consider  a  file  describing  the  employees  of  an  organization.  Assume  that 
new  employee  records  are  given  a  new  ID  number  and  are  inserted  at  the  end  of 
the  file.  In  this  data  base,  the  attribute  values  of  the  attributes  ACE,  SALARY, 
and  RESPONSIBILITY-LEVEL  of  the  records  change  with  the  time.  Records  of 
employees  with  higher  age,  salary,  and  responsibility  level  would  tend  to  be 
located  at  the  beginning  of  the  file,  while  records  of  low  ages,  salaries,  and 
responsibility  levels  would  tend  to  be  located  at  the  end  of  the  file.  Thus,  the 
random  placement  assumption  is  not  satisfied  in  this  case. 

In  clustered  files,  value  dependencies  between  the  clustering  attribute  and 
other  attributes  in  the  file,  as  well  as  correlations,  are  important  for  the  estima- 
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tion  of  block  selectivities.  Consider  an  EMPLOYEE  data  base  clustered  on 
DEPARTMENT#  in  order  to  facilitate  frequent  joins.  Assume  that  only  one 
department  employs  physicists.  Therefore  a  query  asking  for  DEGREE-KIND  = 
"physics"  will  select  records  from  only  a  small  portion  of  the  blocks  of  the  file. 
This  will  result  in  a  considerably  different  number  of  block  accesses  than  what  is 
estimated  using  the  random  placement  assumption. 

Uniformity  and  independence  of  attribute  values  in  queries  assumption 

The  uniformity  of  attribute  values  in  queries  assumption  may  not  be 
satisfied  in  a  real  data  base  enviroment.  The  reason  is  that  certain  attribute 
values  may  be  of  more  interest  to  the  users  than  others.  For  example,  in  a  sta¬ 
tistical  type  of  environment,  users  may  be  more  interested  in  rare  cases  instead 
of  usual  ones.  Pezarro  [Pezarro  1981],  presents  a  performance  prediction  study 
in  a  real  data  base  environment  where  the  uniformity  of  attribute  values  in 
queries  assumption  was  found  unrealistic.  Magalhaes  [Magalhaes  1981]  has  also 
observed  that  in  a  large  actual  data  base  environment  user  queries  were  non- 
uniformly  distributed  over  all  the  attribute  values.  We  vhll  give  some  more 
specific  examples  in  the  next  chapter. 


4.4  Summary 

In  this  chapter  wc  described  several  selectivities  and  how  existing  models 
estimate  them.  We  showed  that  assumptions  used  by  these  models  are  not 
satisfied  in  many  common,  large  data  base  environments.  An  important  ques¬ 
tion  is  whether  data  base  performance  is  insensitive  to  these  assumptions.  If  so, 
the  modelling  of  data  base  contents,  data  placement  on  devices,  and  user- 
requests  by  using  these  assumptions  would  be  acceptable.  However,  as  we  show 
in  the  next  chapter,  this  is  not  the  case.  A  second  question  is,  what  are  the 
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implications  of  using  these  assumptions  on  the  estimated  perform.ance.  We  dis¬ 
cuss  this  question  in  the  next  chapter. 
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Chapter  5:  Consequences  of  the  Invalid  Assumptions 


We  have  seen  in  the  previous  chapter'  that  in  typical  large  data  base 
environments  the  assumptions  used  in  prior  studies  for  the  estimation  of  selec- 
tivities  are  not  satisfied.  In  this  chapter,  we  examine  the  implications  of  using 
these  assumptions  on  data  base  performance  evaluation.  We  show,  formally, 
that  often  these  assumptions  lead  to  pessimistic  estimations  of  the  perfor¬ 
mance.  We  give  several  examples  to  clarify  the  issues  and  to  demonstrate  that 
large  errors  in  the  estimated  performance  may  result.  Finally,  we  discuss  the 
implications  of  these  results  for  query  processing,  index  selection,  and  the 
implementation  of  data  base  performance  predictors. 


5.1  Implications  of  the  Assumptions 

In  this  section,  we  examine  the  impact  of  the  assumptions  of  uniformity  and 
independence  of  attribute  values  in  a  file,  random  placement,  and  uniformity  of 
attribute  values  in  queries  on  the  system  performance.  We  show  that  these 
assumptions  often  predict  an  upper  bound  of  the  expected  system  cost.  Thus, 
they  result  in  "pessimistic"  estimations  of  the  system’s  performance. 

We  wall  first  prove  that  the  random  placement  assumption  maximizes  the 
expected  cost.  We  assume  here  that  the  cost  is  measured  by  the  expected 
number  of  block  acces.ses  required  to  retrieve  the  records  qualifying  in  a  query 


-  40  - 


Q.  For  a  high  degree  of  multiprogramming  and  random  requests  of  data,  this 
number  will  be  approximated  by  the  block  selectivitiy  of  a  query  Q.  Let  n  be  the 
number  of  records  of  the  file  qualifying  in  the  query  Q.  The  problem  of  finding 
the  expected  number  of  blocks  of  the  file  containing  the  n  qualifying  records 
can  be  modelled  as  a  selection  problem  ([Cardenas  1975], [Yao  1977]).  The  cost 
function  that  we  consider  here  is: 

c(PiPi  =  5.1.1 

i=l 

where  7®  is  the  index  cost,  is  the  probability  that  an  arbitrary  record  to  be 
retrieved  is  located  in  block  i,  and  M  is  the  number  of  blocks  in  the  file.  In  this 
cost  function,  is  the  probability  that  block  i  will  not  be  selected  in  n 

record  retrievals  from  the  file.  This  function  is  a  generalization  of  Cardenas’ 
function  that  takes  into  account  non-uniform  likelihoods  of  the  blocks  of  the  file 
to  conteiin  records  qualifying  in  a  query.  This  function  (as  well  as  Cardenas’) 
assumes  replacement  of  the  retrieved  records. 

Theorem  5.1.1:  The  expected  cost  5.1.1  is  maximized  for  uniform 

Proof:  We  will  show  that  C  as  given  by  5.1.1,  subject  to  the  constraint 

M 

Y^P°  =  1,  is  maximized  for  P^  uniform  over  all  the  blocks  of  the  file.  To  do  this 

t=i 

we  use  the  method  of  Lagrange  multipliers.  The  Lagrangian  of  the  problem  is: 

i=l  i=l 

where  X  is  a  Lagrange  multiplier.  We  noxv  solve  the  system  of  equations: 


=  0  for  7  =  1, 


K 


5.1.2 


=  1  5.1.3 

i  =  l 

Since  the  cost  of  accessing  the  index  blocks  does  not  depend  on  the  placement 
of  the  qualifying  records  among  the  blocks  of  the  file,  7^  does  not  depend  on  P?- 
The  partial  derivatives  of  5.1.2  give  the  system  of  equations 
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=  0,  for  i  =  l.  ‘  -  M 

The  solution  of  this  system  of  equations  is.  for  all  v. 


71 

M 

The  parameter  X  can  be  determined  from  the  constrsiint  =  1.  We  obtain: 

<=i 


Substituting  in  5.1.3  we  find  that  the  cost  function  5.1.1  has  an  extreme  value  for 
The  Hessian  matrix  of  the  problem  is  negative  definite  since: 

=  -n(n-l)(l-p9)’>-2<  0 

and 


dP^dPf 

for  all  i,j.  Therefore,  the  cost  function  5.1.1  is  maximized  for  P^  =  1/M.  We 
have  shown  then,  that  from  all  the  possible  likelihood  density  functions,  P^,  the 
uniform  one  maximizes  the  expected  system  cost  5.1.1.  V 

More  detailed  estimations  of  the  system  cost  that  do  not  rely  on  a  replace¬ 
ment  assumption,  as  well  as  a  method  for  finding  P^  for  a  given  query  Q  are 
given  in  Chapter  7.  The  experimental  results  shown  there  indicate  that  the 
error  introduced  by  the  replacement  assumption  is  small  in  comparison  to  the 
error  introduced  by  the  random  placement  assumption,  when  this  assumption  is 
not  satisfied.  The  previous  theorem  generalizes  for  the  approximation  of  the 
block  selectivity  as  well  as  for  the  bounds  of  the  block  selectivity  which  are 
presented  in  Appendix  4,'*‘ 

The  difference  in  the  expected  system  cost  between  uniform  P.p  and  other 
likelihood  density  functions  can  be  considerable  as  we  will  show  with  examples 


+  Another  proof  of  tliis  theorem  can  be  given  by  using  properties  of  Schur  hmctions,  We  vhU 
introduce  Schur  functions  later. 


-  4-2  - 


later.  Thus,  relational  query  optimizers  using  the  random  placement  assump¬ 
tion  will  tend  to  choose  an  expensive  sequential  scan  of  the  file  more  often  than 
is  actually  desirable.  In  the  index  selection  problem,  this  result  implies  that 
certain  attributes  of  a  file,  although  profitable  for  indexing,  may  not  be  selected 
for  indexing  by  a  model  that  uses  the  random  placement  assumption  for 
estimating  the  cost  of  queries.  In  both  problems,  the  average  system  cost  could 
be  reduced  by  taking  into  account  higher  moments  of  instead  of  just  the 
mean.  Thus,  this  result  has  practical  as  well  as  theoretical  significance. 

In  clustered  files,  one  way  to  provide  more  detailed  estimations  of  is  to 
take  into  account  the  dependencies  of  the  data  values  of  the  records  in  the  file. 
Assume  that  the  clustering  attribute  A  of  a  given  file  has  va  distinct  attribute 
values,  and  that  Mi  blocks  of  the  file  contain  records  with  attribute  A  values 


equal  to  Consider  a  restriction  B—'h'  on  a  non-clustering  attri- 

i  =  l 

bute  B.  An  approximation  of  the  system  cost  is  given  by 


1  viV*P(/l  =  a,r\B  =  b) 


) 


5.1.^ 


1  =  1 


where  N*P{A  =  =  5 )  is  the  expected  number  of  records  in  the  file  that 


have  attribute  A  value  equal  to  and  attribute  B  value  equal  to  b.  (It  is 
assumed  that  Mi  ^  1  for  every  i.)  This  approximation  takes  into  account  that 
records  with  attribute  B  value  equal  to  b  may  be  more  frequent  in  the  blocks  of 
the  file  with  certain  attribute  A  values.  We  will  describe  in  more  detail  the  rea¬ 
son  for  using  this  method  for  estimating  block  selectivities,  as  well  as  its  appli¬ 
cations,  in  Chapters  7  and  8.  However,  this  cost  function  approximates  the  cost 
estimated  by  the  random  selection  assumption  if  the  attribute  values  of  the 
attributes  A  and  B  are  assumed  to  be  independent: 


c  = 

i  =  l 


1  s^N*P(A  =b)^ 


Mi 
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1 


N*P{A  =  a.)-P(a  s  6) 


) 


-Z;  iV  ^  il/«P(Z=ai)^  ^ 


•1=1 


=  X;i^i(i-(i-Z)'’'‘'’<®’'’') 
1  =  1 


M 


This  is  the  cost  estimated  by  the  random  placement  assumption.  Thus  when 
independence  's  assumed  among  the  attributes  of  a  file,  the  cost  function  5.1.4 
also  produces  a  high  cost  estimation,  and  the  dependencies  among  attributes 
cannot  be  exploited  to  reduce  the  overall  system  cost. 

We  will  next  discuss  the  implications  on  the  average  system  cost  of  the 
assumption  of  uniformity  and  independence  of  attribute  values  in  a  file.  Let  F  be 
a  file  having  an  attribute  A  with  distinct  attribute  values.  The  average  cost  of 
queries  on  attribute  A  is  given  by 


M 


T]  (  T]  ( 1  -  (1 5.1.5 

i=l  i=i  y=i 

where  is  the  probability  that  a  query  on  the  ith  value  of  attribute  A  is  asked, 
P  is  the  cost  of  retrieving  the  index  blocks  for  the  “ith  value  of  A,  and  is  the 


number  of  records  of  the  file  that  have  the  fth  value  of  attribute  A. 


We  make  the  following  assumptions: 

Assum-ption  1 :  Queries  on  attribute  A  are  uniformly  distributed  over  all  the 
attribute  values  of  A. 

Assumption  2:  The  records  of  F  that  qualify  in  a  query  on  attribute  A  are 
randomly  selected  from  the  file. 

Assumptions  (l)  and  (2)  may  not  be  satisfied  in  many  data  base  environ¬ 
ments.  However,  in  some  database  environments,  they  may  be  realistic.  We  will 
prove  that  the  uniformity  of  attribute  values  assumption  is  pessimistic  when 
these  two  assumptions  are  satisfied,  and  we  will  discuss  the  implications  of 
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relaxing  these  assumptions  later. 

Theorem  5. 1.2:  The  average  cost  5.1.5  is  maximized  for  uniform  distribution 
of  attribute  values  of  attribute  A  under  assumptions  (1)  and  (2). 

Proof:  Under  assumptions  (1)  and  (2)  the  system  cost  5.1.5  becomes 


c{nx, 712.  •  •  •  — 

^  i  =  l  'f^A  1=1  ^ 


5.1.6 


We  will  show,  using  the  method  of  Lagrange  multipliers,  that  C.  as  given  by  5.1.6 

and  subject  to  the  constraint  ^rii  =  N,  where  N  is  the  number  of  records  in  the 

i=l 

file,  is  maximized  for  uniform  74  over  all  the  values  of  attribute  A.  The 
Lagrangian  of  the  problem  is 

^A  i=l  1=1  1  =  1 

where  A  is  a  Lagrange  multiplier.  We  now  solve  the  system  of  equations: 


=  0  for  i  =  l,  .VJ  5.1.7 

071^  Vji  ita 

^A 

1  =  1 
TLi 

The  index  cost  is  given  by  P=NL  where  NL  is  the  number  of  levels  of  the 

Hr  1 


index  and  BFI  is  the  blocking  factor  of  the  pointers  to  records  which  are  stored 


1  ^  ■ 

in  the  lowest  level  of  the  tree.  Thus  - y.P^NL  A - ■  .  which  is  independent 

'^A  v^BFI 


N 


of  94.  The  partial  derivatives  of  5.1.7  give  the  system  of  equations 


■M 


va  M  M 

The  solution  of  this  system  of  equations  is,  for  all  z: 


\V: 


Toi  = 


ln[ - 

In(X-jj) 


Substituting  into  the  constraint  ^  74  =  iV  we  obtain 

1=1 


5.1.8 
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i7l[- 


Xvji 


] 


=  N 


or 


in[- 


Xva 


in(l-i) 


■] 


JL 


Thus  from  5.1.8  we  find  that  the  cost  function  5.1.6  has  an  extreme  value  for 
N 

rij  =  — ,  The  Hessian  matrix  of  the  problem  is  negative  definite  since 
'iJA 


d^L 

dn^ 


—M,^  1  xn,/,  1  "«2 


vP^-p 


and 


dnidn^i 

N 

for  all  i,j.  Therefore,  the  cost  function  5.1.6  is  maximized  for  7^^  =  - .  V 

We  have  shown  then,  that  under  assumptions  (1)  and  (2),  the  uniformity  of 
attribute  values  of  attribute  A  is  a  pessimistic  assumption.  This  is  a  somewhat 
unexpected  result,  since,  if  the  system  cost  were  measured  by  the  expected 
number  of  records  qualifying  in  queries  on  attribute  A,  the  cost  would  be 
independent  of  the  distribution  of  the  attribute  values  of  attribute  A  under 
assumption  (l).  This  assumption  has  been  used  extensively  in  data  base  perfor¬ 
mance  evaluation  studies.  The  reason  that  this  result  holds  is  that  the  cost 
function  is  concave  with  respect  to  the  number  of  qualifying  records. 

An  important  practical  consequence  of  this  result  is  that  between  two  attri¬ 
butes  that  are  candidates  for  indexing,  it  may  be  more  profitable  to  index  the 
one  with  non-uniform  attribute  values.  (Of  course  other  factors  affect  the 
choice  also.)  Thus,  non-uniformity  of  attribute  values  could  be  exploited  to 
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reduce  the  expected  system  cost. 


We  will  show  next  a  stronger  result  that  says  that  as  the  attribute  values 
deviate  more  from  uniformity,  the  expected  cost  will  decrease  (assuming  unifor¬ 
mity  of  queries).  To  compare  deviation  from  uniformity  of  two  given  distribu¬ 
tions,  the  concept  of  Tn.ajorizatio7i  has  been  used  [Marshall  and  Olkin  1979].  A 

vector  a  =  (ai, a2.  '  ‘  ‘  .0^)  with  ai  i  a2  ^  ‘  1  an  is  said  to  majorize  a  vector 

k  k 

b  =  (6  1.  &  a  ■  ■  ■  .  bn )  with  6  1  ^  b  a  ^  ‘  ^  bn  if  2  ^  ^  bt  for  fc  =  1  to  n  - 1.  and 

1=1  1=1 

n  n 

^  Ot  =  Intuitively,  if  a  vector,  a,  majorizes  another  vector,  b,  then  a  devi- 

1=1  1=1 

ates  more  from  uniformity  than  b  does.  A  powerful  theorem  associates  majori- 
zation  with  a  class  of  functions  called  Schur  concave  functions.  A  function  f  of  n 
real  variables  is  called  Schur  concave  if  for  every  pair  i  ^  j, 


Theorem  5.1.3  (Schur):  If  /  is  a  Schur  concave  function  and  a  majorizes  b, 
then  /  (a)  ^  /(b). 


A  proof  of  this  theorem  appears  in  [Marshall  and  Olkin  1979].  Applications 
of  this  theorem  in  optimization  problems  have  appeared  in  [Wong  1980],  [Parker 
1980],  and  [Aleliunas  and  Suwanda  1981]. 

Two  important  functions  that  are  Schur  concave  are  the  entropy  and  the 
second  moment  about  the  origin.  The  entropy  of  the  attribute  vedues  of  an  attri- 

bute  A  is  defined  as  AT  =  -  2j  where  N  is  the  number  of  records  in  the 

i=i  Av  Jy 

file,  7ii  is  the  number  of  values  with  value  for  attribute  A,  and  Vj^  is  the  total 
number  of  distinct  values  of  A.  The  entropy  is  a  Schur  concave  function  because 


(rii  -  nj)i 


dH 

drii 


dH  . 

dUj- 
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-  nj)log^  g  0 

N  rii 
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The  second,  moment  of  a  distribution  ordered  by  decreasing  frequency  of  values 

1 

is  defined  as  ^  Schur  concave  function  because 

^  i  =  l 


(Jh  -  nj){ 


dS^  dS^  '’•2  '■•2 


)  =  (7ii  -  nj  )(47  -  4r)  ^  0 


dui  drij  ^  N  N 

We  will  show  next  that  our  cost  function  5.1.6  is  also  Schur  concave. 

Theorem  5.1.4:  The  cost  function  C{ni,n2,  '  •  ■  given  by  5.1.6  is  Schur 

concave.  Therefore,  if  a  majorizes  b,  then  C(a)  ^  C’(b). 

Proof: 


-  (1  -  i)”*) 


Then 


{rii  -  7ij){ 
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los(l  -  ^)(ni  -  ">)((!  -  -IfP  -  (1  -  S 


Vi  '  M’'  '  M 

Thus,  C  as  given  by  5.1.6  is  Schur  concave.  V 


0 


The  cost  function  C  is  symmetric  with  respect  to  n  1,713,  ■  ■  •  ,7t,,^.  Therefore, 
we  can  always  assume  that  rii^ng^  •  •  •  sind  examine  if  one  vector  a  of  attri¬ 

bute  value  frequencies  majorizes  another  vector  b  of  attribute  value  frequen¬ 
cies.  As  a  result  of  Schur’s  theorem,  if  a  and  b  are  frequencies  of  the  attribute 
values  of  two  attributes  A  and  B  with  the  same  number  of  attribute  values  such 
that  a  majorizes  b,  and  the  queries  are  uniformly  distributed  over  the  attribute 
values  of  A  and  B,  then  the  entropy  of  the  attribute  values,  their  second 
moment,  and  the  expected  cost  per  query  is  less  for  attribute  A  than  for  attri¬ 
bute  B.  Thus,  this  discussion  relates  the  ideas  "deviation  from  uniformity", 
"majorization",  "increased  entropy",  "increased  second  moment",  and 


"increased  cost". 
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The  above  result  can  be  extended  for  queries  involving  more  than  one  attri¬ 
bute. 

Theorem  5.1.5:  The  uniformity  and  independence  of  attribute  values 
assumption  results  in  pessimistic  estimations  of  the  expected  system  cost  under 
assumptions  (1)  and  (8). 

Proof:  Consider  two  attributes  and  Ag  with  Vi  and  Vg  attribute  values 
respectively.  Let  Qi  be  a  conjunctive  query  on  a  combination  of  attribute  values 
of  A  1  and  A 2-  Qi  selects  a  distinct  subset  of  records  from  the  file,  such  that 

Y]  Tii  N.  Since  queries  are  assumed  to  be  uniformly  distributed  on  every 
1=1 

attribute  value  combination  of  the  two  attributes,  the  probability  of  Qj  is 
— .  Therefore  this  problem  can  be  formulated  exactly  as  in  the  proof  of 

Theorem  5.1.2.  The  cost  is  maximized  for  uniform  rii,  which  occurs  for  uniform 
and  independent  values  of  A  i  andAg.  The  formulalion  of  the  problem  is  similar 
for  disjunoLive  queries  also.  (The  constraint  in  this  case  will  be 

rii  =  (yi  +  V2-  1)N. )  V 

i  =  l 

We  have  shown  that  the  uniformity  and  independence  of  attribute  values 
assumption  results  in  pessimistic  estimations  of  the  expected  system  cost  for 
uniformly  distributed  conjunctive  or  disjunctive  queries  and  randomly  placed 
records.  This  result  implies  that  if  we  have  selected  an  attribute  A  for  indexing, 
then,  provided  that  the  assumptions  of  the  theorem  hold,  another  good  candi¬ 
date  for  indexing  is  an  attribute  B  highly  correlated  with  A.  While  the  expected 
number  of  records  qualifying  in  a  conjunctive  query  on  A  and  B  will  be  the  same, 
whether  B  is  correlated  with  A  or  not  (and  thus  the  information  revealed  by  its 
existence  is  the  same),  the  average  cost  of  accessing  the  qualifying  records  will 


be  less  if  B  is  correlated  with  A. 
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We  can  generalize  the  previous  theorems  by  relaxing  assumption  2. 

Theorem  5.1.6:  For  uniform  queries  the  average  sytem  cost  5.1.5  is  maxim¬ 
ized  under  the  random  placement  and  the  uniformity  of  attribute  values 
assumptions. 

Proof:  For  uniform  queries  the  cost  5.1.5  becomes: 


^(ni,  ■  •  •  ,ny^,Pl.  •  •  • 

^  C(ni,  •  •  •  •  ■  ■  ,^) 


because  of  Theorem  5.1.1, 


^  C(-^  •  •  •  ~ — —  •  •  ■  — ) 

~  'va’M'  ’ 

because  of  Theorem  5.1.2.  Therefore,  the  cost  when  the  queries  are  uniformly 
distributed  over  the  attribute  values  is  less  than  what  is  predicted  under  the 
uniformity,  independence,  and  random  placement  assumptions.  V 

We  now  discuss  the  assumption  of  the  uniformity  of  attribute  values  in 
queries.  It  is  our  conjecture  that  the  uniformity  of  attribute  values  in  queries  is 
often  a  pessimistic  assumption.  One  reason  for  this  is  that  the  users  may  be 
more  interested  in  rare  cases  than  in  usual  ones.  For  example  managers  may  be 
more  interested  in  the  employees  that  have  higher  salaries  than  in  the  employ¬ 
ees  with  average  salaries.  A  second  reason  is  that  in  a  selection  type  enviroment 
(like  libraries  or  security  surveillance  environments)  users  try  to  specify  items 
with  high  discriminatory  power  instead  of  usual  ones.  The  larger  the  data  base 
the  more  important  this  factor  is  because  users  do  not  want  answers  consisting 
of  too  many  records.  Finally  where  unknown  values  of  an  attribute  exist, 
queries  are  likely  to  specify  one  of  the  remaining  attributes.  Since  often  unk¬ 
nown  values  involve  a  large  proportion  of  records,  queries  are  directed  to  values 
involving  smaller  proportions.  It  is  possible  therefore  that  the  uniformity  of 
attribute  values  in  queries  may  tend  to  be  a  pessimistic  assumption. 
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For  many  realistic  data  base  environments,  therefore,  models  using  all 
these  assumptions  tend  to  provide  pessimistic  estimations  of  the  cost.  This 
implies  that  attributes  that  are  not  profitable  for  indexing  or  linking  in  a  model 
that  uses  these  assumptions  may  be  found  profitable  when  a  more  detailed 
model  is  used.  Therefore,  the  system  cost  could  be  reduced  by  exploiting  non- 
random  placement  and  variances  and  covariances  of  the  attribute  values  in  the 
file,  as  well  as  the  non-uniformit}''  of  attribute  values  in  queries. 


We  will  next  consider  multifile  environments.  A  general  type  of  query  in 
these  environments  may  require  operations  on  individual  files  (like  restrictions 
and  projections)  followed  by  joins  of  individual  files,  Wc  ’ivill  use  a  different  cost 
criterion,  one  which  is  often  important  in  query  evaluation.  This  is  the  number 
of  distinct  attribute  values  that  remain  after  a  restriction  on  another  attribute. 
This  number  may  determine,  for  example,  the  strategy  that  we  will  use  for 
accessing  the  records  of  the  second  file  that  participates  in  a  join.  If  only  a  few 
values  of  the  joining  attribute  are  selected  by  the  restriction,  we  may  want  to 
use  the  index  on  the  joining  attribute  to  access  the  second  file.  If  a  large 
number  of  values  qualify,  we  may  want  to  sort  the  file  first  [Yao  and  De  Jong 
1979],  If  semi-joins  are  used  as  a  means  of  processing  joins,  we  may  want  to 
estimate  the  number  of  remaining  values  in  the  joining  attribute  to  decide  about 
the  order  of  semi-join  processing  or  the  profitability  of  cyclic  semi-joins  [Bern¬ 
stein  and  Chiu  1981].  The  problem  of  finding  the  expected  number  of  values  of  A 
that  remain  after  restrictions  on  an  attribute  B  can  be  formulated  as  a  selection 
problem  ([Yao  and  De  Jong  1978],  [Goodman  et  al.  1979]).  The  number  of  values 
of  A  selected  by  a  restriction  on  B  is  given  by 


C  =  V(1_(1-P,,^) 


5.1.9 


i  =  l 

where  Npj  is  the  expected  number  of  records  of  the  file  qualifying  in  a  restric¬ 
tion  on  the  value  j  of  the  attribute  B,  Piu  is  the  probability  that  the  value  i  of  A 
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appears  in  the  records  selected  by  the  restriction  on  the  value  j  of  attribute  B, 
and  is  the  number  of  values  of  attribute  A. 

Theor27n  5. 1.7:  For  a  given  value  of  j  the  cost  C  given  by  5.1.9  is  maximized 
for  Pi\j  uniform  and  independent  of  j.  Moreover,  if  the  queries  on  B  are  uni¬ 
formly  distributed  over  all  its  values,  the  average  cost  is  maximized  under  the 
uniformity  and  independence  assumptions. 

Proof:  For  a  given  value  j  of  B  we  consider  pj  to  be  constant.  The  values  of 

obey  the  constraint  ~  shown  as  before  that 

i  =  l 

C{P ^j,P2\j,  ■  ■  ■  >Pv^\j)  maximized  for  uniform  over  all  the  values  V4 

of  A  (and  therefore  the  distribution  Pi\j  that  maximizes  the  cost  function  is  also 
independent  of  the  value  j).  Assuming  uniformity  of  queries  on  attribute  B  we 
can  show  in  a  way  similar  to  Theorem  5.1.6  that  the  uniformity  and  indepen¬ 
dence  of  attribute  values  of  A  and  B  maximizes  this  cost.  V 

This  resiilt  implies  that  models  using  the  uniformity  and  independence  of 
attribute  values  assumption  may  utilize  expensive  pessimistic  join  strategies 
(like  sequential  scan  or  sorting)  more  often  than  necessary.  In  distributed 
environments  where  the  semi -join  is  often  a  profitable  method,  the  assumption 
of  the  current  optimizers  is  that  ail  restrictions  are  processed  first  and  then  the 
strategy  is  decided.  This,  however,  could  introduce  delays.  On  the  other  hand, 
the  uniformity  and  independence  assumption  may  hide  certain,  more  profitable, 
strategies.  Dependencies  of  attribute  values  can  also  be  exploited  in  other  types 
of  multifile  queries  (like  selections  followed  by  joins)  to  find  the  least  expensive 
strategy.  The  choice  of  the  optimal  query  evaluation  strategy  for  various  corre¬ 
lations  of  atti'ibute  values  is  discussed  in  Chapter  8. 
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5.2  Examples 

In  this  section  we  present  several  examples  to  clarify  the  previous  results. 
The  examples  are  selected  to  illustrate  the  points  of  the  previous  section  and  to 
demonstrate  that  large  errors  in  the  estimated  performance  may  result  when 
the  assumptions  are  not  satisfied.  Two  data  bases  are  used  in  these  examples. 
The  engineering  data  base  [Engineering  report  1979],  which  is  assumed  to  have  a 
blocking  factor  of  50  records  per  block,  and  a  hypothetical  file  F  of  100000 
records  and  a  blocking  factor  of  20  records  per  block. 

We  will  first  discuss  the  implications  of  the  random  placement  assumption 
when  it  is  not  satisfied.  Consider  an  example  from  the  engineering  data  base. 
Assume  that  the  file  is  clustered  according  to  the  YEAR-OF-GRADUATION. 
(Almost  the  same  ordering  is  obtained  if  new  engineer  records  are  simply 
inserted  at  the  end  of  the  file.)  Due  to  the  correlations  of  SALARY  and 
RESPONSIBILITY-LEVEL  v/ith  YEAR-OF-GRADUATION,  the  probability  that  records 
that  qualify  in  queries  on  the  attributes  SALARY  and  RESPONSIBILITY-LEVEL  is 
not  uniform  over  the  blocks  of  the  file.  A  maximum  number  of  blocks  that  con¬ 
tain  qualifying  records  for  a  given  blocking  factor,  can  be  estimated  from  the 
statistics  given  in  the  engineering  report  (similar  to  Table  4.3.4).  For  a  blocking 
factor  of  50  records  per  block,  a  query  asking  for  the  records  of  the  file  with 
responsibility  level  A  will  retrieve  at  maximum  108  blocks.  However,  under  the 
random  placement  assumption,  virtually  all  the  blocks  of  the  file  have  to  be 
retrieved  (287  blocks).  The  large  relative  error  is  due  to  the  fact  that  the  ran¬ 
dom  placement  assumption  is  unrealistic  in  this  case.  From  a  query  processing 
point  of  view,  the  small  number  of  blocks  containing  qualifying  records  suggests 
the  use  of  an  index  for  accessing  them  instead  of  a  sequential  scan.  From  an 
index  selection  point  of  view,  the  high  correlations  of  these  attributes  with  the 
clustering  attribute  of  the  file  results  in  highly  non-uniform  placement  of  the 
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qualifying  records  among  the  blocks  of  the  file.  Thus  these  attributes  become 
profitable  for  indexing. 

To  demonstrate  the  effect  of  value  dependencies  on  estimated  perfor¬ 
mance,  assume  that  the  file  F  contains  information  on  employees  and  that  it  is 
clustered  on  DEPT#.  Under  a  blocking  factor  of  20,  the  records  of  the  file 
require  in  500  blocks.  Assume  that  there  are  10  departments  with  1000  employ¬ 
ees  each,  and  that  physicists  are  only  employed  by  the  department  of  engineer¬ 
ing.  Assume  that  there  are  500  physicists.  Under  the  i-andorn  placement 
assumption,  the  query  "retrieve  the  employees  with  a  degree  in  ph5''sics"  is  going 
to  retrieve  approximately  300  blocks.  However,  the  actual  number  of  blocks  of 
the  file  containing  qualifying  records  is  50.  The  difference  in  the  estimated  per¬ 
formance  is  due  to  the  value  dependencies  between  the  clustering  attribute  and 
the  attribute  DEGREE. 

The  uniformity  and  independence  assumptions  may  also  result  in  large 
errors.  Consider  two  examples  from  the  engineering  data  base.  A  query  asking 
for  the  engineers  in  responsibility  level  "F"  is  expected  under  the  uniformity 
assumption  to  retrieve  2416  records.  However,  only  1038  engineers  are  in 
responsibility  level  "F".  The  relative  error  in  the  estimation  is  high.  As  a  second 
example  consider  the  query;  "responsibility  level  A  and  more  than  15  years  from 
degree".  The  uniformity  and  independence  assumptions  lead  to  the  conclusion 
that  1470  records  qualify  in  the  query  instead  of  the  actual  6.  In  this  case,  the 
estimation  of  record  seiectivities  based  on  the  independence  assumption  leads 
to  a  large  error.  Using  the  actual  distributions  of  responsibility  level  and  years 
from  graduation,  but  assuming  independence  of  attribute  values  between  the 
two  attributes,  the  expected  number  of  records  qualifying  in  the  query  is 
estimated  to  be  1300  instead  of  the  actual  6.  The  relative  error  is  again  very 
high,  despite  avoiding  the  assumption  of  uniformitj?-.  These  discrepancies 


between  the  actuai  and  the  estimated  number  of  records  that  qualify  in  a  query 
can  lead  to  wrong  estimations  of  the  expected  cost  of  a  query  evaluation  stra¬ 
tegy  in  a  relational  data  base  environment,  with  the  result  that  the  optimizer 
may  choose  a  non-op  Lima!  strategy  for  procesing  the  query. 

To  demonstrate  the  effect  of  the  uniformity  of  attribute  values  assumption 
in  estimating  the  performance  of  data  bases,  consider  again  the  file  F.  Let  A  be 
an  attribute  of  the  file  with  20  distinct  attribute  values.  We  assume  that  the 
records  qualifying  in  queries  on  A  are  randomly  placed  in  the  blocks  of  the  file 
and  that  queries  are  uniformly  distributed  over  all  the  attribute  values  of  A.  If 
the  records  of  the  file  have  attribute  values  that  are  uniformly  distributed  over 
all  the  attribute  values  of  A,  then  the  average  cost  per  query  can  be  estimated 
as  3160  blocks.  On  the  other  hand,  if  one  value  of  A  accounts  for  40%  of  the 
records  of  the  file,  while  the  remaining  records  are  uniformly  distributed  over 
the  other  values,  then  the  average  cost  is  estimated  to  be  2473^ blocks.  If  two 
values  of  A  account  for  80%  of  the  records  in  the  file  and  the  remaining  records 
are  uniform.!)''  distributed  over  the  other  attribute  values,  then  the  average  cost 
is  estimated  to  be  1396  blocks.  These  figures  indicate  that  the  system  cost  is 
sensitive  to  the  distribution  of  the  records  of  the  file  over  the  attribute  values, 
and  that  the  uniformity  assumption  results  in  overestimations  of  the  data  base 
cost. 

We  have  shown  that  the  cost  is  a  Schur  concave  function.  Therefore,  assum¬ 
ing  that  the  queries  are  uniformly  distributed  over  the  attribute  values,  the  cost 
of  the  queries  on  an  attribute  A,  if  its  attribute  values  follow  the  distribution 
shown  in  figure  5.2.1,  is  less  than  the  cost  of  queries  on  this  attribute  if  its  attri¬ 
bute  values  follow  the  distribution  Similar  examples  can  be  given  for  the 

discrepancies  in  the  estimated  performance  introduced  by  the  existence  of 
correlations.  From  an  index  selection  point  of  view,  these  figures  show  that  the 
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Record 

frequencies 


Figure  5.2.1 

Comparison  of  costs  for  two  attributes 

profitability  of  an  attribute  for  indexing  depends  on  the  distribution  of  attribute 
values  (as  well  as  other  factors).  If  the  qualifying  records  are  not  randomly 
placed  in  the  file  even  more  significant  discrepancies  between  estimated  and 
actual  performance  will  appear. 

To  demonstrate  the  significance  of  the  non-uniformity  of  queries  in  the  esti¬ 
mation  of  the  system  cost  consider  again  the  example  of  attribute  A  with  two 
attribute  values  involving  80%  of  the  records  of  the  file,  and  the  other  IS  values 
involving  an  equal  proportion  of  records  each.  Consider  the  two  extremes:  1) 
Almost  all  queries  ask  for  one  of  the  two  values  of  A  that  involve  80%  of  the 
records.  In  this  case,  the  average  cost  per  query  on  attribute  A  is  5000  blocks. 
2)  Almost  all  queries  on  A  ask  for  one  of  the  18  other  values.  In  this  case  the 
average  cost  per  query  is  1  100  blocks.  Thus  the  estimated  performance  and  the 
profitability  of  attribute  A  for  indexing  depends  on  the  distribution  of  queries  on 
the  attribute  values. 


Finally,  we  will  comment  on  the  usefulness  of  the  uniformity  and  indepen- 


dence  assumption  model  for  estimating  the  expected  number  of  values  of  the 
joining  attribute  that  will  be  selected  by  a  restriction  on  another  attribute.  For 
a  reasonable  number  of  values  in  the  joining  attribute,  the  model  will  predict 
that  all  the  attribute  values  of  the  joining  attribute  will  be  selected.  As  an  exam¬ 
ple  consider  two  attributes  A  and  B  of  the  file  F  with  1^0  attribute  values  each. 
Assume  an  equality  restriction  on  A  followed  by  a  join  on  B  with  another  file  F‘. 
The  number  of  records  qualifying  for  the  restriction  on  A  is  100000/20  =  5000 
records  with  the  uniformity  assumption.  Using  the  independence  assumption  we 
conclude  that  all  the  attribute  values  of  B  appear  in  these  records.  This,  in 
many  cases,  would  indicate  that  an  appropriate  strateg}?^  for  accessing  the 
records  of  the  file  F‘  that  qualify  in  the  join  is  an  expensive  sort  of  the  file  F‘ 
according  to  the  values  of  the  joining  attribute.  For  example  if  F‘  has  81  blocks 
of  records  and  3  buffers  are  available  for  sorting,  the  sort  cost  will  be 
2*81*^0^381  =  8*81  =  648  block  accesses  (assuming  a  three  way  merge  sort). 
This  is  preferable  to  sequentially  accessing  the  file  F‘  for  each  value  of  the  join¬ 
ing  attribute  which  will  cost  20*81  =  1620  block  accesses.  However,  if  the  values 
of  the  attributes  A  and  B  of  the  file  F  are  not  independent,  it  may  be  the  case 
that  the  5000  records  of  F  that  qualify  in  the  restriction  on  A  contain  only  one 
distinct  value  of  B.  In  this  case  a  profitable  strategy  for  accessing  the  records  of 
F‘  is  sequential  scan  which  will  cost  only  81  block  accesses  in  comparison  to  the 
643  of  the  sort.  Thus,  in  this  example,  the  most  profitable  strategy  cannot  be 
found  if  a  model  based  on  the  uniformity  and  independence  assumptions  is  used. 
The  only  way  to  avoid  potentially  large  errors  in  estimates  is  to  keep  additional 
information  describing  the  dependencies  of  the  I'cstriction  attribute  with  the 
joining  attribute. 
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5.3  Summary 

In  Chapter  4,  we  provided  evidence  that  the  assumptions  of  uniformity  and 
independence  of  attribute  values,  random  placement  of  qualifying  records 
among  the  blocks  of  a  file,  and  uniformity  of  queries  are  often  not  satisfied  in 
actual  data  base  environments.  Wc  have  shown  in  this  chapter  that  when  these 
assumptions  are  not  satisfied,  large  errors  in  the  estimation  of  performance 
may  result.  We  have  also  proven  that  these  assumptions  often  lead  to  pessimis¬ 
tic  estimations  of  the  data  base  cost. 

These  results  have  some  important  implications  for  data  base  performance. 
For  the  implementation  of  data  base  design  performance  predictors,  they  sug¬ 
gest  that  large  errors  may  be  introduced  by  using  these  assumptions.  For  rela¬ 
tional  query  optimization,  these  results  imply  that  optimizers  based  on  these 
assumptions  will  choose  worst  case  strategies  (like  sequential  scan  or  sorting) 
more  often  than  necessary.  For  data  base  design,  these  results  imply  that  the 
exploitation  of  non-uniformity  and  dependencies  of  attribute  values,  non-random 
placement  of  qualifying  records  in  the  blocks  of  a  file,  and  non-uniformity  of 
queries  could  reduce  the  overall  system  cost. 

Also,  there  are  some  important  consequences  of  these  results  for  the  index 
selection  problem.  One  is  that  it  is  more  profitable  to  provide  an  index  on  an 
attribute  with  non-uniform  distribution  of  attribute  values  than  on  ones  with  uni¬ 
form  distributions.  (Other  factors  affect  the  choice  also.)  This  does  not  seem  to 
agree  with  intuition.  It  would  seem  more  appropriate  to  index  on  attributes 
that  have  a  uniform  distribution  of  attribute  values  in  order  to  avoid  the  fact 
that  in  some  queries  (these  asking  for  attribute  values  involving  a  large  number 
of  records)  little  information  is  provided  from  the  use  of  the  index.  However, 
the  information  provided  by  the  use  of  an  index  on  an  attribute  depends  not  only 
on  the  distribution  of  attribute  values  in  the  file,  but  also  on  the  distribution  of 
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attribute  values  in  the  user  queries. 

Consider  conjunctive  queries  on  two  attributes  A  and  B  of  a  file.  The  infor¬ 
mation  provided  by  an  index  on  A  depends  on  the  attribute  values  of  A  appearing 
in  the  user  queries.  If  most  queries  ask  for  attribute  values  of  A  involving  a  large 
number  of  records,  the  information  provided  by  an  index  on  A  is  not  large.  If 
most  queries  on  A  ask  for  attribute  values  involving  a  small  number  of  records, 
the  information  provided  by  an  index  on  A  is  large.  For  queries  uniformly  distri¬ 
buted  over  the  attribute  values  of  A  the  amount  of  information  provided  is 
independent  of  the  distribution  of  the  attribute  values  of  A  in  the  file  (because 
for  queries  uniformly  distributed  over  the  attribute  values,  the  expected 
number  of  records  selected  is  the  same  for  any  distribution  of  attribute  values 
in  the  file).  This  is  also  true  in  the  data  base  design  phase  when  no  statistics 
exist  on  the  attribute  values  participating  in  queries.  (In  this  case,  it  is  usually 
assumed  that  queries  are  uniformly  distributed  over  the  attribute  values.) 

However,  although  the  amount  of  information  provided  from  uniform  or 
non-uniform  distribution  of  attribute  values  is  the  same  for  unifoi  in  queries,  the 
average  cost  of  retrieving  the  qualifying  records  is  not.  We  have  shown  that  the 
expected  cost  is  less  for  non-uniform  distribution  of  attribute  values  in  the  file. 
Thus  attributes  with  non-uniformly  distributed  attribute  values  may  be  more 
profitable  for  indexing  than  attributes  with  uniformly  distributed  attribute 
values.  (The  number  of  distinct  values  is  also  a  factor.) 

For  similar  reasons,  a  second  consequence  of  these  results  is  that  correla¬ 
tions  or  dependencies  of  an  attribute  with  other  attributes  already  selected  for 
indexing  make  it  a  better  candidate  for  indexing  (when  queries  are  uniformly 
spread  over  all  the  combinations  of  the  attribute  values).  Finally,  in  clustered 
files,  it  is  more  profitable  to  select  for  indexing  attributes  that  have  attribute 
values  correlated  (or  value  dependent)  with  the  clustering  attribute,  since  the 
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records  qualifying  in  queries  on  these  attributes  are  not  randomly  placed  over 
all  the  blocks  of  the  file.  Thus  an  optimum  indexing  set  in  non-clustered  files 
may  not  be  optimum  in  a  clustered  file  environment.  We  will  examine  these 
issues  in  more  detail  in  Chapter  8. 

In  order  to  be  able  to  provide  better  estimates  of  the  data  base  cost,  more 
detailed  modelling  of  data  base  contents,  data  placement  on  devices  and  user 
requests  is  necessary.  We  will  propose  methods  for  more  detailed  modelling  in 
subsequent  chapters. 
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Chapter  6:  Estimation  of  Record  Selectivities 


6.1  Introduction 

In  this  chapter,  we  examine  methods  for  providing  better  approximations  of 
record  selectivities  in  flat  files.  We  regard  the  records  of  a  flat  file  as  points  of 
an  n-dimensional  space,  where  n  is  the  number  of  attributes  of  the  file.  Each 
attribute  corresponds  to  an  axis  of  the  n-dimensional  space.  Attribute  values  of 
quantitative  attributes  (like  SALARY,  AGE,  and  HEIGPIT)  correspond  directly  to 
points  of  the  axis.  Attribute  values  of  categoric  attributes  (like 
RESPONSIBILITY-LEVEL,  SEX,  and  DEGREE)  are  mapped  into  discrete  values  in 
the  corresponding  axis.  The  attribute  values  of  a  record  of  the  file  determine  a 
point  in  this  space.  We  hereafter  refer  to  this  space  as  the  attribute  space. 

Typical  data  bases  contain  tens  or  hundreds  of  attributes,  and  some  attri¬ 
bute  domains  contain  a  large  number  of  attribute  values.  The  record  selectivity 
of  each  query  would  be  completely  specified  if  we  kept  information  on  the 
number  of  records  for  every  combination  of  attribute  values  in  the  attribute 
space  of  the  file.  However,  to  keep  this  information,  the  amount  of  storage 
required  is  exponential  in  the  number  of  attributes  {v^,  where  v  is  the  number 
of  distinct  values  for  each  attribute  and  a  is  the  number  of  attributes).  Since 
this  is  very  expensive  for  a  realistic  environment,  techniques  that  approximate 
the  record  selectivity  become  important. 
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The  approach  described  in  this  chapter  is  based  on  the  approximation  of 
the  density  distribution  of  points  in  the  attribute  space  using  a  multivariate  den¬ 
sity  distribution  function.  This  density  function  is  then  used  to  derive  estimates 
of  record  selectivities. 

We  first  present  a  parametric  approach  which  is  suitable  for  data  bases 
describing  populations  characterized  by  unimodal  distributions  of  attribute 
values  and  correlated  attributes.  Data  bases  containing  information  about  popu¬ 
lations  are  numerous  and  they  often  involve  a  large  number  of  records.  We  show 
the  estimation  of  various  selectivities  and  we  present  experimental  results  that 
indicate  improvements  in  the  accuracy  of  estimation.  We  then  present  a 
different  approach  which  is  appropriate  for  data  bases  that  have  many  attri¬ 
butes  with  non-continuous,  non-numeric  attribute  values. 


6.2  A  Parametric  Approach 

Traditionally  parametric  and  non-parametric  techniques  have  been  used  for 
describing  the  distribution  of  points  in  an  n-dimensional  space  ([Andrews  1972], 
[Duta  and  Hart  1973],  [Tou  and  Gonzalez  1974]).  Parametric  techniques  assume 
that  the  density  has  a  known  distribution  form  and  use  the  available  samples  to 
estimate  the  parameters  of  the  distribution.  Non-parametric  techniques  do  not 
make  an  a  priori  assumption  about  the  density  distribution  of  the  data,  but  use 
the  available  samples  to  determine  the  shape  of  the  distribution. 

In  this  section  we  describe  a  parametric  technique  for  estimating  record 
selectivities.  Although  non-parametric  techniques  could  prove  more  accurate  if 
a  large  number  of  parameters  is  kept,  our  objective  in  this  section  is  to  estimate 
record  selectivities  by  only  keeping  a  small  number  of  parameters.  We  refer  to 
this  model  of  record  selectivities  as  the  co7itivM.orLS  model  because  it  uses  a  con¬ 
tinuous  density  function  to  describe  the  density  distribution  in  the  attribute 
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space.  We  describe  the  distribution  of  records  in  the  attribute  space  using  a 
member  of  a  family  of  diverse  distributions.  An  important  parameter  of  the 
family  of  distributions  is  the  covariance  matrix  of  the  variables.  The  covariance 
matrix  is  a  s3'-mmetric,  positive  definite  matrix,  and  is  defined  as 
C  =  A’[(x-m)(x  -  m)'],  v.'^here  m  is  the  mean  vector  of  the  distribution,  prime 
denotes  transposition,  and  E  denotes  expectation.  For  example,  in  the  two- 
dimensional  case  (n  =  2),  the  covariance  matrix  is: 

/OlgClO-g  0-2 

where  cr^  and  erg  are  the  standard  deviations  of  the  two  variables,  and  pjg  is  their 
correlation.  Thus,  the  continuous  model  of  data  base  contents  takes  into 
account  the  correlation  of  attribute  values. 

The  family  of  distributions  includes  the  multivariate  Normal  distribution 
and  the  multivariate  extensions  of  the  Pearson  Type  2  and  Type  7  distributions 
[Elderton  1953]. 


Figure  6.2.1 

A  family  of  univariate  distributions 


In  one  dimension,  the  above  family  includes  a  wide  range  of  probability  density 
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functions.  The  various  probability  functions  involved  are  shown  in  figure  6.2.1 
normalized  to  have  the  same  peak  value.  The  univariate  Pearson  Type  2  involves 
symmetric  distributions  ranging  from  the  uniform  distribution  for  k=Q  to  para¬ 
bolic  for  Ar  =  l  and  approaching  the  Normal  for  large  k.  This  distribution  is  non¬ 
zero  only  over  a  finite  region  and,  thus,  well  suited  for  describing  the  distribu¬ 
tion  of  attribute  values  in  a  data  base  environment.  In  one  dimension,  the  Type 
7  Pearson  distributions  include  the  Normal  for  high  values  of  k,  the  t  distribution 
for  half  integer  values  of  k  and  an  appropriate  scale  parameter,  and  the  Cauchy 
distribution  for  fc  =  l. 

The  multivariate  Normal  distribution  is  important  and  is  central  to  the  fam¬ 
ily  of  our  distributions.  The  probability  density  function  of  the  multivariate  Nor¬ 
mal  distribution  is  given  by; 

^  -  (l/3)(x-  m)'C-Hx-  m)] 

where  C  is  the  covariance  matrix  of  the  variables  and  m  is  the  mean  vector.  The 
points  of  constant  density  are  hyperellipsoids  for  which  =  (x-  m)'C“^(x— m)  is 
constant,  is  usually  called  the  Mahalanobis  distance  from  x  to  m.  Figure 
6.2.2  shows  curves  of  constant  density  for  the  Normal  distribution  in  a  two- 
dimensional  space. 

The  univariate  Pearson  Type  2  distribution  is 

with \x  -Tn\  ^  — 

CJ 

where  B  is  the  Beta  function,  and  cj  is  a  scaling  constant.  A  multivariate  exten¬ 
sion  of  this  distribution  is 

|a(x)  in  region  T 
p(")  “  elsewhere 

where  T  is  the  interior  of  the  hyperellipsoid  (x-m)'W(x-m)  =  1,  and 
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Attribute  2 
values 


Figure  6.2.2 
Normal  density  curves 


where  F  is  the  Gamma  function.  The  scaling  matrix  W  is  given  by 


W  = 


Zk  +71  +  2 


1-1 


and  A;  ^  0  is  a  parameter. 


The  univariate  Pearson  Type  7  distribution  is 


p(x)  = 


CJ 


B(iy2,k  -  1/2) 

A  multivariate  extension  of  this  distribution  is 


[1  +  cu^(x 


p(x)  = 


mi 


-  n/2) 

with  2k  >71 


W|  +  (x-  m.)'yt{x  -  m.)]“* 


For  21c  >71  +  q.  the  g'th  moment  exists  [Cooper  1964].  In  particular,  if  2k  >7^  +  2 

1 


the  covariance  matrix  exists  and  W  = 


-  71  -  2 


c-1 


In  addition  to  the  variety  of  distributions  which  it  includes,  this  family  of 
distributions  has  some  other  desirable  properties  for  our  purpose.  Since 
queries  on  the  data  base  refer  to  only  a  subset  of  all  the  attributes,  it  is 
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important  that  the  parameters  of  the  distribution  followed  in  the  subspace 
specified  by  the  attributes  appearing  in  the  query  be  determined  efficiently. 
The  family  of  distributions  described  has  this  property.  The  marginal  distribu¬ 
tion  of  the  multivariate  Normal  is  also  Normal,  and  the  marginals  of  Type  2  and 
Type  7  Pearson  distributions  remain  of  the  same  T57-pe.  In  going  from  a  space  of 
dimensionality  n  to  a  space  of  dimensionality  m  <  n,  the  portion  of  the  original  n 
dimensional  covariance  matrix  corresponding  to  the  m -subspace  is  preserved. 
The  same  is  true  for  the  inverse  scaling  matrix,  i.e.,  The  parameter  k'  of 

the  distribution  becomes  in  the  m-subspace  k,  where 

k  —  k'  +  {tl  —  m)/2  for  Type  2,  and 
k  -  k'  —  {n  -m)/2  for  Type  7 

Thus  if  queries  refer  to  only  a  subset  of  the  attributes,  the  probability  distribu¬ 
tion  followed  by  the  attribute  values  in  this  subspace  can  be  easily  found  without 
the  need  of  integration. 

A  second  important  property  of  this  family  of  distributions  is  that  the  esti¬ 
mation  of  parameters  and  the  selection  of  a  member  of  the  family  can  be  done 
in  one  pass  of  the  available  data.  During  this  pass,  the  mean  values  and  the 
fourth  moments  for  each  attribute,  as  well  as  the  covariance  matrix,  are 
estimated.  Details  of  the  parameter  estimation  are  given  in  Appendix  1. 

A  third  important  property  is  that  the  parameters  of  the  distributions  are 
adaptive.  Adaptivity  of  parameters  allows  us  to  periodieally  update  the  parame¬ 
ter  values  to  reflect  the  values  of  the  new  records  inserted  in  the  data  base 
without  scanning  all  of  the  data  base  again.  Formulae  for  the  adaptive  update 
for  this  family  of  distributions  are  also  given  in  Appendix  1. 

Finally,  the  parameters  of  the  distributions  have  an  intuitive  significance 
since  they  refer  to  quantities  which  arc  well  understood  and  well  described  in 
the  literature,  such  as  means,  standard  delations,  correlations,  and  kurtosis. 
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Intuitive  significance  of  parameters  allows  the  parameters  to  be  approximated 
in  the  absence  of  actual  data  (data  base  design  phase),  as  well  as  to  draw  useful 
inferences  for  the  data  base  design  based  on  the  values  of  these  parameters. 

Next,  we  describe  the  estimation  of  record  selectivities.  Given  a  conjunc¬ 
tive  query  specifying  any  number  of  attributes  of  a  file,  we  would  like  to  estimate 
the  expected  number  of  records  qualifying  for  the  query.  Formally,  a  conjunc¬ 
tive  query  Q  has  the  form 


Q  =  (g,  X,  Ax) 

where  q  is  the  m-dimensional  attribute  subspace  specified  by  the  attributes  par¬ 
ticipating  in  the  query  and  x  -  and  Ax  =  (Ax^^,  Ax^^...,  Ax^^)'  are 

vectors  in  q.  The  interpretation  is  that  the  conjunctive  query  Q  reqiiests  all  the 
records  of  the  file  which  have  attribute  values  y  =  ,  •  •  •  ,yi  )'  in  the  sub- 

set  q  of  their  attributes  such  that  +  Ax^^  for  y  =  1,  •  •  •  ,m.  This  formu¬ 

lation  allows  asking  for  ranges  of  attribute  values  as  well  as  for  equality  of  attri¬ 
bute  values  for  categoric  attributes.  (In  this  case,  Ax^  will  be  the  difference  of 
two  consecutive  discrete  values  in  the  axis  of  the  attribute.)  The  expected 
number  of  records  qualifying  for  the  query  Q  is  therefore  given  by 

X  -1-^ 


=  jV 


r 

J 


1  h 


E(Q)  =  H  f  p(j)dy 


(■■■if 


Z-  +AZ'  * 

'm  V. 


piyi,‘  ■  ■  ■ 


where  N  is  the  number  of  records  in  the  file.  For  small  values  of  Axi^,  ...  ,  Ax^ 


the  above  quantity  can  be  approximated  by 


EiQ)  =  Np{k  +  -^)Axi^Axi2  •  •  •  Axi^ 

where  7?(x)  is  the  value  of  the  probability  density  function  at  the  point  x  of  the 
space.  This  formula  can  be  used  for  the  estimation  of  record  selectmties  in 
conjunctive  queries. 


The  estimation  of  the  record  selectivity  of  a  disjunctive  query 


-  67  - 


Q  =  Qi'^jQq  ■  ■  ■  [jQ^  ,  where  is  a  subquery  on  attribute  can  be  approxi¬ 
mated  by  —  y^T'Hj  where  rii  is  the  record  selectivity  of  the  query  Qi  and  riij  is 

i  ij 

the  record  selectivity  of  the  conjunctive  query  Qi  n  Qj.  The  selectivity  of  disjunc¬ 
tive  normal  form  queries  ean  be  estimated  in  a  similar  way,  where  Qi  now  is  a 
conjunction.  Conjunctive  normal  form  queries  can  be  transformed  into  disjunc¬ 
tive  normal  form,  and  thus  their  selectivities  can  be  estimated  in  a  similar  way. 

As  an  example,  consider  the  application  of  the  continuous  model  for  the 
estimation  of  record  selectivities  in  the  engineering  data  base.  A  selection  pro¬ 
gram  used  the  data  in  the  engineering  report  for  selecting  a  member  of  the  fam¬ 
ily  of  distributions  to  approximate  the  distribution  of  points  in  the  attribute 
space.  A  prediction  program  was  then  used  to  estimate  the  record  selectivity 
for  a  particular  query  using  the  selected  member  of  the  family  of  distributions. 
Using  these  programs,  the  number  of  engineers  in  responsibility  level  "F"  is 
estimated  to  be  1128  compared  to  the  actual  1036,  a  relative  error  of  .04.  In 
comparison,  the  relative  error  using  the  uniformity  and  independence  assump¬ 
tions  was  .37  (as  was  showm  in  the  examples  of  chapter  5).  The  estimated 
number  of  engineers  in  responsibility  level  "A"  and  more  than  15  years  of  experi¬ 
ence  is  zero,  compared  to  the  actual  6.  The  relative  error  is  .05  instead  of  .98 
under  the  uniformity  and  independence  assumption  and  .97  given  by  the 
independence  assumption  only  (when,  we  know  the  exact  number  of  records  per 
value  for  each  attribute).  In  this  example,  the  continuous  model  improved  con¬ 
siderably  the  estimation  of  record  selectivities  compared  to  the  other  models. 

The  accuracy  of  prediction  of  record  selectivities  using  the  continuous 
model  and  the  uniformity  and  independence  assumption  model  was  compared 
for  all  single  attribute  equality  queries  on  the  attributes  RESPONSIBILITY-LEVEL 
and  YEARS-OF-EXPERIENCE,  as  well  as  on  conjunctive  queries  on  these  attri¬ 
butes.  The  selection  and  prediction  programs  described  before,  were  used  for 
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this  purpose.  Equality  queries  on  the  attribute  RESPONSIBILITY-LEVEL  gave  an 
average  normalized  relative  error  of  9%,  while  the  corresponding  error  of  the 
uniformity  assumption  was  21%.  The  percentage  of  queries  for  which  the  con¬ 
tinuous  model  achieved  a  better  prediction  versus  the  prediction  of  the  unifor¬ 
mity  assumption  was  100%.  Range  queries  on  the  attribute  YEARS-OF- 

EXPERIENCE  gave  an  average  relative  error  of  12%,  while  the  corresponding 

error  using  the  uniformity  assumption  was  22%.  The  percentage  of  queries 
where  the  model  achieved  better  prediction  was  66%.  (The  attribute  values  of 
this  attribute  were  nearer  to  uniform  than  the  attribute  values  of 

RESPONSIBILITY -LEVEL).  The  average  normalized  relative  error  of  the  conjunc¬ 
tive  queries  on  the  two  attributes  was  11%  using  the  model  and  27%  using  the 
uniformity  and  independence  assumptions.  The  percentage  of  conjunctive 

queries  where  the  model  achieved  better  approximations  than  the  uniformity 
and  independence  assumption  was  30%. 

The  density  distribution  in  the  attribute  space  of  the  engineering  data  base 
is  approximately  unimodal.  To  compare  the  error  of  the  continuous  model  with 
the  error  of  the  uniformity  and  independence  assumption  model  for  an  arbitrary 
distribution  in  the  attribute  space,  simulation  was  used.  Small  size  files  with  two 
attributes  were  created  by  using  random  permutations  of  the  attribute  values  of 
the  records.  For  each  permutation,  the  correlation  of  the  attribute  values  was 
measured  and  recorded.  Then  the  selection  and  prediction  programs  were  used 
to  calculate  the  average  error  of  the  continuous  model  for  ail  conjunctive 
queries  on  the  two  attributes.  The  average  error  of  all  queries  using  the  unifor¬ 
mity  and  independence  assumption  model  to  estimate  record  selectivities  was 
also  calculated.  The  results  of  the  simulation  are  sho'wm  in  figure  6.2.3.  The  two 
solid  lines  correspond  to  the  average  error  over  all  queries  (and  all  files  created) 
of  the  two  models  when  the  distributions  followed  by  the  attribute  values  of  each 
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attribute  were  uniform.  The  two  dotted  lines  correspond  to  the  average  error  of 
the  two  models  when  the  distributions  followed  were  unimodal.  The  results  show 
that  in  the  uniformity  and  independence  assumption  model  the  error  increases 
with  the  absolute  value  of  the  correlation.  The  average  prediction  error  is  large 
even  when  the  distribution  in  each  axis  is  uniform.  This  is  due  to  the  fact  that 
high  correlations  imply  non-uniform  distributions  in  the  two-dimensional  attri¬ 
bute  space.  On  the  other  hand,  the  eiior  of  the  continuous  model  decreases 
with  high  correlations.  This  is  due  to  the  fact  that  high  correlations  imply  that 
most  points  are  near  the  correlation  line,  and  attribute  value  combinations 
involving  no  records  of  the  file  were  near  each  other,  far  from  the  correlation 
line.  This  distribution  can  be  better  approximated  by  a  unimodal  famiily  of  dis¬ 
tributions.  The  above  observation  suggests  that  data  correlations  may  be 
exploited  to  improve  estimation  accuracy. 


Average 

error 


relative 


Figure  6.2.3 

Average  error  versus  correlation 
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Correlations  versus  Dependencies 

We  examine  next  the  relationship  between  dependencies  and  correlations, 
and  show  that  correlations  can  increase  by  rearranging  the  values  of  one  of  the 
attributes.  This  relationship  has  implications  for  the  accuracy  of  the  approxi¬ 
mation  using  this  family  of  distributions  because,  as  the  results  of  the  simula¬ 
tion  show,  the  average  error  of  the  model  decreases  for  highly  correlated  attri¬ 
butes.  Since  functional  depcndcn  GIGS  G.T'G  special  cases  of  the  dependencies  that 
we  discuss  here,  this  presentation  relates  ideas  used  in  data  base  theory  with 
quantitative  measures  useful  in  data  base  performance  evaluation. 

Consider  two  attributes  X  and  Y  of  a  file.  We  will  assume  that  Y  is  an 
"important"  attribute.  An  important  attribute  can  be  a  joining  attribute,  a  clus¬ 
tering  attribute,  or  an  attribuLe  that  is  often  used  in  conjunctions.  The  reason 
for  distingijishing  an  important  attribute  is  that  we  will  keep  its  values  fixed  and 
rearrange  the  values  of  the  other  attributes  such  that  the  attribute  values  of  the 
two  attributes  present  maximum  correlations.  (If  the  values  of  the  important 
attribute  could  be  rearranged  also,  the  optimum  rearrangement  would  still 
depend  on  the  second  attributes.)  We  see  the  records  of  the  file  separated  into 
categories  corresponding  to  the  distinct  values  of  the  attribute  X.  The  correla¬ 
tion  coefficient  is  estimated  as: 


(6.2.1) 


r 


were  is  the  mean,  is  the  standard  deviation  in  the  X  axis  (both  depend  on 
the  assignment  of  values  in  this  axis),  is  the  conditional  mean  on  the  Y  axis 

for  the  records  in  the  category  i,  and  yj^  and  are  the  unconditional  means 
and  standard  deviations  in  the  Y  axis.  We  will  minimize  this  expression,  subject 


to  the  constraint  V 


i 


Nx^ 


—  1,  using  the  method  of  Lagrange  multipliers. 
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The  Lagrangian  of  the  problem  is 


r  1  ^  v-t  '^i^i  ~  ®m) 


Differentiating  we  obtain 


dL 


9( 


Xi~x^  .  N  ^  ys 


for  all  the  categories  i.  The  solution  of  the  system  of  equations  is 


2X  Vs 


—  X 

Substituting  In  the  constraint  Yl~17 — ~ — ^ ^  obtain 

i  a^s 

TLi  ^y^i  ~  y^  ^ 

j:-^n — —  =  1 


4X2 


y* 


or 


Thus  an  optimum  is  obtained  for 


Xi  —  X. 


m 


y'n%^  Vm 


7Li  Vm,  Vm 
^  y,  ^ 


Vs 


(6.2.2) 


The  denominator  of  the  second  part  of  this  equation  is  independent  of  the 
category  i.  We  can  represent  it  with  a  constant  The  solution  becomes: 

rC 


Xi  -  Xr. 


~  ^  (Vm^  Vm) 


The  Hessian  matrix  of  the  problem  is  negative  definite  since 


d^L 


a(fL__^)2 


X. 


XTLj 

=  -8^  <  0 


for  ail  i.  Thus  the  above  solution  maximizes  the  correlations.  V 


However,  we  want  to  assign  to  each  category  not  an  arbitrary  value,  but 
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consecutive  integers.  The  problem  with  this  constraint  becomes  considerably 
more  difficult.  A  heuristic  solution  motivated  by  the  solution  of  the  continuous 
problem  is  to  find  the  conditional  means  on  the  Y  axis  for  each  category  i, 
form  the  differences  for  all  i,  order  them,  and  assign  values  to  each 

category  i  proportional  to  the  order  of  the  corresponding  -  y^.  As  an  exam¬ 
ple,  if  we  want  to  obtain  high  correlations  between  the  attributes 
RESPONSIBILITY-LEVEL  and  YEARS-OF-EXPERIENCE  (assuming  that  YEARS-OF- 
EXPERIENCE  is  an  important  attribute),  we  can  find  the  mean  number  of  years 
of  experience  for  each  responsibility  level  (in  one  pass  of  the  data),  order  them, 
and  assign  values  to  each  responsibility  level  according  to  their  order.  This  pro¬ 
cedure  will  give  us  the  obvious  ordering  since  high  responsibility  levels 
correspond  to  many  years  of  experience.  However,  this  may  not  be  obvious  for 
other  attributes.  From  a  data  base  theory  point  of  view,  the  above  presentation 
implies  that  functional  dependencies  may  (with  appropriate  mapping  of  attri¬ 
bute  values)  result  in  high  correlations.  However,  the  maximum  correlation 
becomes  one  if  and  only  if  the  relationship  is  one  to  one  as  can  be  seen  from  the 
relations  (6.2.1)  and  (6.2.2).  Moreover,  two  attributes  may  present  very  high 
correlations  without  being  functionally  dependent  on  each  other.  From  a  per¬ 
formance  point  of  view,  this  implies  that  the  accuracy  of  this  model  can  increase 
in  the  presence  of  functional  dependencies  (according  to  figure  6.2.3),  while  the 
accuracy  of  the  uniformity  and  independence  assumption  model  decreases. 

Normalization  has  an  important  implication  for  the  correlations  of  the  attri¬ 
bute  values  of  the  records  of  a  file.  It  tends  to  create  files  with  attributes  that 
have  attribute  values  with  correlations  equal  to  the  correlations  of  these  attri¬ 
butes  in  the  real  world.  Consider  a  file  with  attributes  (E#,  AGE,  SALARY,  DEPT^) 
such  that  E#  functionally  determines  AGE,  SALARY,  and  DEPT^.  Since  for  each 
employee  there  is  only  one  record  in  the  file,  the  correlation  between  the  attri- 
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butes  AGE  and  SALARY  in  the  file  will  be  the  same  as  the  correlation  of  these  two 
attributes  in  the  real  world.  Consider  now  the  file  (E#,  AGE,  SALARY,  DEPT^ijif, 
CITY-OF-DEPT),  where  the  E#  does  not  determine  functionally  the  CITY-OF-DEPT. 
In  this  file,  the  relationship  between  AGE  and  SALARY  for  a  particular  employee 
appears  in  the  data  base  as  many  times  as  there  are  cities  for  his  department. 
Thus,  various  (AGE,  SALARY)  combinations  in  the  file  acquire  various  weights.  As 
a  result,  the  correlation  between  the  attributes  AGE  and  SALARY  is  not  the  same 
in  Lhis  non-normalized  file,  as  the  one  in  the  real  world.  One  implication  of  this 
result  is  that  samples  from  the  real  world  can  be  used  in  the  data  base  design 
stage  to  predict  the  correlations  of  the  attributes  in  the  file. 

We  next  discuss  the  application  of  the  continuous  model  in  arbitrary  files 
with  distributions  that  are  not  necessarily  unimodal.  The  continuous  model  uses 
unimodal  multivariate  density  functions  to  approximate  the  density  distribution 
in  the  attribute  space.  This  suggests  that  it  will  approximate  best  densities  that 
are  unimodal.  Unimodality  in  each  axis  may  not  always  result  in  high  correla¬ 
tions.  As  an  example  of  where  it  is  possible  to  obtain  both  unimodality  and  high 
correlations  by  rearranging  the  attribute  values,  consider  the  attributes 
CITIZENSHIP,  COUNTRY-OF-HIGHEST-DEGREE,  and  FIRST-LANGUAGE.  We  can 
order  CITIZENSHIP  such  that  it  presents  a  unimodal  distribution.  If  we  maintain 
a  similar  mapping  in  the  other  two  attributes,  then  we  will  have  both  unimodality 
and  high  correlations.  In  the  case  where  this  is  not  possible,  unimodality  may 
be  prefered  since  more  queries  may  be  single  attribute  queries  and  also  the 
number  of  records  involved  in  single  attribute  queries  is  usually  larger. 

We  used  the  continuous  model  to  model  an  inventory  file  of  1080  records 
with  11  attributes.  Table  6.2.1  shows  the  distributions  chosen  by  the  selection 
program  to  approximate  the  distribution  of  the  attribute  values  of  the  attri¬ 
butes.  (Type  is  the  type  of  the  distribution  and  K  is  the  parameter  k  determin- 
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Tab 

e  6.2.1 

Att# 

Type 

K 

Aveu 

Avem 

Prwin 

1 

2 

5.0 

.73 

.03 

1.00 

2 

'V 

I 

6.0 

.70 

.04 

.95 

3 

7 

4.8 

.58 

.06 

.88 

4 

2 

0.0 

.52 

.15 

.93 

5 

7 

3.8 

.60 

.15 

.92 

6 

7 

3.7 

.65 

.13 

.90 

7 

7 

3.2 

.55 

.20 

.81 

8 

7 

3.3 

.55 

.16 

.85 

9 

2 

28.3 

.63 

.06 

.95 

10 

2 

4.1 

.43 

.03 

1.00 

11 

7 

9.5 

.58 

.10 

.90 

Av. 

.63 

.11 

.91 

ing  the  specific  member  of  the  family  selected).  We  have  assigned  numeric 
values  to  all  the  categoric  attributes  so  that  the  distribution  of  attribute  values 
in  each  axis  is  approximately  symmetric  unimodal.  This  implies  two  file  passes 
for  selecting  the  required  information.  The  majority  of  the  distributions  shown 
have  high  peaks  as  indicated  by  the  T5'’pe  and  the  parameter  k  of  the  distribu¬ 
tion  (Type  2  distributions  with  k  values  near  zero  approximate  the  uniform  dis¬ 
tribution).  This  indicates  again  that  the  uniformity  assumption  is  unrealistic  in 
real  data  base  environments.  The  table  shows  the  average  normalized  relative 
error  in  the  estimation  of  selectivities  over  ail  equality  queries  on  a  given  attri¬ 
bute  (Aveu)  using  the  uniformity  assumption.  (For  non-discrete  value  attri¬ 
butes,  the  range  of  attribute  values  has  been  divided  into  subranges  corr  esporid- 
ing  to  queries.)  The  average  normalized  relative  error  using  the  continuous 
model  is  also  shown  (Avem).  The  last  column  in  the  table  (Prwin),  shows  the  pro¬ 
portion  of  the  equality  queries  in  which  the  continuous  model  achieves  a  lower 
error  in  the  estimation  of  selectivities.  A  similar  set  of  performance  measures 
(but  more  specific  to  query  optimizers)  has  been  proposed  in  a  recent  perfor¬ 
mance  evaluation  of  the  System  R  query  optimizer  [Astrahan  et  at.  1980].  A 
comparison  of  the  two  approaches  shows  that  the  continuous  model  consider¬ 
ably  improves  the  estimation  of  selectivities  for  single  attribute  queries. 
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Table  6.2.2 

Att^ 

Aveu 

Avem 

Prwin 

1.8 

.25 

.04 

.94 

2,3 

.37 

.08 

.92 

3,4 

.38 

.07 

.91 

4,5 

.32 

.07 

.91 

5,6 

.43 

.11 

.92 

6,7 

.33 

.09 

.37 

7.8 

.19 

.07 

.87 

8,9 

.20 

.07 

.87 

9,10 

.55 

.12 

.85 

10,11 

.53 

.17 

.82 

Aver 

.35 

.09 

.89 

Table  6.2.2  compares  the  accuracy  of  the  continuous  model  with  the  unifor¬ 
mity  and  independence  assumption  model  in  conjunctive  queries  on  a  subset  of 
the  possible  combinations  of  two  attributes.  In  the  file,  a  significant  proportion 
of  the  combinations  of  attribute  values  of  two  attributes  did  not  involve  any 
records  of  the  file  at  all,  which  makes  uniformity  inappropriate.  Our  results  sug¬ 
gest  that  a  significant  improvment  in  the  estimation  of  selectivities  in  multiattri¬ 
bute  queries  can  be  achieved  by  using  the  continuous  model  instead  of  a  model 
that  uses  the  uniformity  and  independence  assumptions,  even  for  attributes 
that  do  not  present  unimodal  distributions  and  correlated  attribute  values. 
However,  in  this  file,  the  continuous  model  did  not  significantly  improve  over  a 
model  that  used  the  exact  distributions  in  each  axis  and  assumed  independence 
of  attribute  values  to  estimate  the  record  selectivities  of  conjunctive  queries. 
This  is  due  to  the  fact  that  the  correlations  between  attributes  in  this  file  were 
sm  all. 

6.3  Average  Record  Selectivities 

In  this  section,  we  present  a  model  for  the  estimation  of  average  record 
selectivities.  We  separate  the  queries  into  classes  as  in  [Anderson  and  Berra 
1977]  and  [Hammer  and  Chan  1976]  according  to  the  attributes  participating  in 
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a  conjunctive  or  a  disjunctive  query.  We  view  the  queries  of  a  class  as  points  in 
an  771-dimensional  sub-space  of  the  attribute  space,  where  m  is  the  number  of 
attributes  specified  in  this  class  of  queries.  We  call  this  sub-space  the  query 
class  space.  We  use  the  same  family  of  probability  density  functions  to  describe 
the  distribution  of  this  class  of  queries.  We  use  g  (x)  to  denote  the  probability 
density  in  the  query  class  space. 

For  simplicity  we  assume  here  that  the  attribute  values  are  distinct  and 
that  all  the  queries  are  equality  queries  such  that  the  intervals  Axi,  •  •  .  Az^ 
specified  in  queries  are  constant  for  all  the  queries  specifying  a  given  attribute. 

More  general  estimates  however  are  possible.  The  number  of  records  qualifying 
in  a  query  {q,  x,  Ax)  is  given  by  (x)Ar  jAzg  ‘  '  '  Ar^^  as  in  the  previous  section. 
The  expected  number  of  records  qualifying  in  a  query  of  a  class  c  therefore  is 

E{Nc)  =  / g(x)(AXx)Azi  ■  •  •  Ar„,.)dx 
=  NAxi  ■  •  •  !^Xmf  g(x)p(x)dx 

The  expected  number  of  records  qualifying  in  queries  in  any  subset  of  the 
attributes  of  the  class  (included  in  a  candidate  indexing  set  say)  can  be 
estimated  by  considering  only  the  relevant  attributes  in  the  above  formula.  If 
all  the  queries  in  the  data  base  were  single  attribute  queries,  this  approach  of 
estimating  average  record  selectivities  will  produce  results  similar  to  those  of 
Hammer  and  Chan.  However,  in  multiattribute  retrieval  environments,  the  two 
approaches  are  considerably  different. 

The  integral  in  the  formula  can  be  approximated  for  the  different  distribu¬ 
tions  considered.  The  solution  in  the  case  thatp(x)  and  q(x)  are  Normal  distri¬ 
butions  (as  shown  in  Appendix  2)  is  given  by 


E{iNc)  =  N- 


Arcj  •  •  •  Ax^^e 


where  S  is  the  covariance  matrix  in  the  attribute  subspace,  S  is  the  covariance 


matrix  in  the  query  class  space,  C  is  a  positive  definite  matrix,  and  M  is  a 


-  77  - 


constant. 

As  an  example  consider  single  attribute  equality  queries  on  the  attribute 
RESPONSIBILITY-LEVEL  of  the  engineering  data  base.  The  average  record  selec¬ 
tivity  depends  on  which  are  the  more  frequent  queries.  If  most  queries  ask  for 
RESPONSIBILITY-LEVEL  =  "D",  the  average  selectivity  will  be  near  to  the  actual 
record  occurrences  of  RESONSIBILITY-LEVEL  =  "D"  (4345),  while  if  most  queries 
ask  for  RESPONSIBILITY-LEVEL  =  "A”,  the  average  selectivity  will  be  near  to  the 
record  occurrences  of  RESPONSIBILITY-LEVEL  =  "A"  (1355).  If  the  queries  were 
uniformly  spread  over  all  the  possible  values,  the  average  selectivity  would  be 
equal  to  the  mean  value  of  record  occurrences  per  responsibility  level  (2416). 
These  three  estimates  are  considerably  different.  Figure  6.3.1  shows  the  aver¬ 
age  record  selectivity,  as  estimated  by  the  continuous  model,  as  a  function  of 
the  standard  deviation  of  the  RESPONSIBILITY- LEVEL  values  used  in  queries.  (In 
this  experiment  we  assumed  that  the  means  of  the  distributions  in  the  attribute 
space  and  the  query  space  coincided.) 


Average  record 
selectivity 


values  in  queries 


Figure  6.3.1 

Average  selectivity  in  single  attribute  queries 
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In  multiattribute  queries,  the  average  record  selectivity  depends  on  the 
correlations  of  attribute  values  in  the  data  base,  as  well  as  on  the  correlation  of 
the  attribute  values  used  in  the  queries  on  the  data  base.  Figure  6.3.2  shows  the 
average  record  selectivity  as  a  function  of  the  correlations  of  the  attribute 
values  in  the  file  and  in  the  queries,  as  estimated  by  the  continuous  model.  (In 
this  experiment  the  means  of  the  distributions  in  the  attribute  space  and  the 
query  space  did  not  coincide.) 
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Figure  6.3.2 

Average  record  selectivity  in  conjunctive  queries 


6.4  Approximating  the  Multidimensional  Histogram 

Non-parametric  techniques  have  been  used  to  provide  better  approxima¬ 
tions  of  the  density  function  in  a  multidimensional  space  with  the  expense  of 
more  storage  for  storing  the  larger  number  of  parameters  required.  Two  impor¬ 
tant  non-parametric  approaches  are  the  'potential  funtion  approach  and  the 
clustering  approach  ([Tou  and  Gonzalez  1974],  [Duta  and  Hart  1973]).  Each  of 
them  has  many  var  iations. 
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The  potential  function  approach  assigns  to  each  sample  a  potential  (a  func¬ 
tion  that  has  a  high  peak  at  the  sample  point  and  decreases  rapidly  at  any  point 
away  from  the  sample).  The  cumulative  potential  describes  the  density  distribu¬ 
tion  in  the  multidimensional  space  .  There  are  two  types  of  potential  functions. 
Type  1  potential  functions  approximate  the  density  as  a  sum  of  exponentials  and 
provide  very  good  approximations  of  the  multidimensional  density.  However, 
they  have  to  store  all  the  available  samples,  and  use  them  in  the  evaluation  of 
the  density  at  one  point.  Thus,  they  are  extremely  expensive  and  impractical 
for  our  purpose.  Type  2  potential  functions  provide  polynomial  approximations 
of  the  density  [Specht  1967].  However,  as  Meisel  observes  [Meisel  1972],  func¬ 
tional  approximations  may  need  many  terms  in  order  to  provide  good  approxi¬ 
mations  of  the  probability  density. 

The  clustering  approach  detects  clusters  of  points  in  an  n  dimensional 
space.  After  the  clusters  have  been  identified  and  the  points  in  the  space  have 
been  assigned  a  cluster  membership,  the  probability  density  for  each  cluster 
can  be  represented  as  the  sum  of  densities  [Agrawala  and  Mohr  1978].  Cluster¬ 
ing  approaches  usually  require  many  iterations  through  the  data  points,  as  well 
as  some  data  preparation  (scaling)  in  order  to  separate  points  into  clusters. 

Important  characteristics  of  the  above  approaches  that  discourage  their 
use  in  a  general  data  base  environment  are  the  assumption  that  the  density  is  a 
continuous  function,  the  emphasis  on  all  the  dimensions  at  the  same  time,  and 
the  assumption  of  strictly  numeric  attributes.  In  most  data  base  environments, 
we  often  encounter  discrete  distributions,  queries  refer  to  a  small  subset  of  the 
attributes  only,  and  certain  attributes  or  subsets  of  attributes  appear  more 
often  in  queries  than  others  ([Magalhaes  1980],  [Hammer  and  Chan  1976]). 
Moreover,  it  is  often  the  case  that  attributes  have  many  distinct,  non-numeric 
attribute  values.  When  this  is  the  case,  a  mapping  of  the  non-numeric  attribute 
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values  into  reed  numbers  is  required.  This  mapping  can  be  very  expensive  in 
space  (often  more  expensive  than  the  cost  of  storage  of  the  parameters  of  the 
distribution)  and  can  make  these  techniques  impractical. 

We  present  here  a  different  approach,  more  appropriate  for  a  general 
environment.  We  call  this  model  the  discrete  model.  This  approach  reduces  the 
storage  requirements  for  approximating  the  overall  density.  It  does  so  by  keep¬ 
ing  more  informative  statistics,  and  by  reducing  the  mapping  requirements. 
The  information  that  is  provided  by  a  set  of  statistical  measurements  depends 
on  how  often  these  measurements  are  used  in  predicting  the  actual  density.  If 
they  are  not  often  used,  they  are  of  less  value.  For  example,  if  detailed  statis¬ 
tics  are  kept  about  an  attribute  that  does  not  often  participate  in  queries,  the 
amount  of  information  revealed  by  the  statistics  about  the  data  base  contents  is 
not  large.  This  motivates  keeping  more  statistics  about  attributes  or  subsets  of 
attributes  that  are  often  utilized  in  queries.  (Note  that  this  is  not  possible  with 
parametric,  and  many  non-parametric,  techniques).  To  avoid  excessive  mapping 
we  should  somehow  abstract  the  true  density  in  each  attribute.  A  density  func¬ 
tion  in  one  dimension  can  be  approximated  by  a  histogram  with  unequal  inter¬ 
vals  as  shown  in  figure  6.4.1.  The  advantage  with  an  abstraction  like  this  is  that 
now  the  mapping  is  required  only  at  the  points  where  the  histogram  height 
changes.  If  lexicographic  ordering  is  used,  comparison  mth  the  stored  mapping 
points  will  show  in  which  interval  a  specific  value  belongs.  The  larger  the 
number  of  intervals,  the  better  this  histogram  approximates  the  true  density. 
Thus,  we  can  take  into  account  the  fact  that  certain  attributes  are  used  more 
often  in  queries  by  allowing  more  intervals  in  their  histogram. 

To  provide  an  algorithm  for  constructing  such  a  histogram,  we  need  to  use  a 
criterion  to  determine  which  values  are  clustered  together  in  one  interval.  We 
will  assume  in  the  following  that  all  the  attributes  have  discrete  attribute  values. 
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Frequency- 


Figure  6.4.1 

Approximating  one-dimensional  densities 

(Attributes  with  continuous  attribute  values  can  be  considered  as  discrete  if  the 
continuous  interval  is  divided  into  a  large  number  of  discrete  sections.)  Letp^  be 
the  proportion  of  the  records  in  the  file  with  attribute  value  Ai  for  attribute  A. 
V»"e  will  use  a  maximum  difference  criterion,  i.e.,  the  attribute  A  values  with  pro¬ 
portion  Pi  are  clustered  together  if  max(pi)  —  min(pi)  £  dj^,  where  d]^  is  a  con- 

i  i 

slant  depending  on  the  attribute  K.  There  are  some  reasons  for  choosing  a  max¬ 
imum  difference  criterion  (instead  of  an  average  square  error  say).  The  first  is 
that  it  is  easy  to  implement.  A  second  is  that  queries  on  an  attribute  may  be 
concentrated  on  specific  attribute  values  instead  of  being  uniformly  spread  over 
all  attribute  values.  Thus,  it  may  be  more  important  to  keep  the  error  of  each 
value  small  instead  of  optimizing  an  average  error.  A  third  is  that  this  error 
may  be  better  suited  for  attributes  with  non-continuous  attribute  values.  To 
prevent  the  intervals  from  being  very  large  (if  some  section  of  the  distribution  of 
attribute  values  is  near  uniform),  a  maximum  width  can  be  used.  In  fact  the 


maximum  width  determines  a  minimum  number  of  intervals  in  each  axis.  To 
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construct  a  histogram  using  this  method,  it  is  only  necessary  to  keep  the  smal¬ 
lest  and  the  highest  value  encountered  so  far  in  the  current  interval,  and  com¬ 
pare  it  with  the  new  value.  If  the  new  value  makes  the  difference  greater  than 
the  threshold  value  dj^,  or  the  width  of  the  interval  becomes  larger  than  the 
maximum  width  a  new  interval  is  started.  For  each  interval  of  the  histo¬ 
gram,  the  beginning  of  the  interval,  its  height  (average  proportion  of  records 
per  value  in  this  interval)  and  its  width  (number  of  values  in  this  inlerval),  is 
kept. 

When  the  need  for  reduced  mapping  by  approximating  the  histogram  of 
attribute  values  with  intervals  of  constant  density  arises,  no  parametric  or  non- 
parametric  technique  can  be  used  for  the  approximation  of  the  oi'-erail  density 
(with  the  exception  perhaps  of  a  clustering  technique  that  would  form  hyperrec- 
tangular  clusters).  To  estimate  the  probability  densities  for  higher  dimensional¬ 
ities,  the  independence  assumption  could  be  invoked.  Although  the  indepen¬ 
dence  assumption  may  result  in  large  errors  even  when  the  distribution  of  attri¬ 
bute  values  of  each  attribute  is  known,  the  distributions  in  each  axis  impose  cer¬ 
tain  restrictions  on  the  probability  distributions  in  higher  dimensionalities, 
which  may  limit  the  possible  error  of  the  independence  assumption.  For  exam¬ 
ple  consider  a  two-dimensional  space  and  two  intervals  and  Ij  in  the  two  axes 
including  rii  and  rij  records  respectively.  The  number  of  records  TTj,  in  the  rec¬ 
tangle  Rij  of  the  two-dimensional  space,  which  has  projections  the  intervals  Ii 
and  Ij  in  the  two  axis,  is  bounded  by  0  ^  n^j  ^  min(7T.i  ,  rij).  If  one  of  rrj  or  rij  is 
small  then  the  value  of  riij  is  small,  and  the  error  of  estimation  in  the  rectangle 
Rij  will  not  be  large.  This  observation  restricts  the  portions  of  the  subspace  in 
which  large  errors  may  occur.  It  also  imposes  certain  restrictions  which,  if 
exploited,  may  reduce  the  overall  error. 

The  above  suggests  that  to  avoid  large  errors,  for  important  subspaces,  we 
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would  need  to  keep  statistics  on  the  rectangles  with  a  large  number  of  records 
in  them,  according  to  a  threshold  7^  depending  on  the  importance  of  the  sub¬ 
space  k.  However,  these  are  not  the  most  informative  statistics  that  we  can 
keep  for  the  subspace  k.  The  reason  is  that  this  information  may  be  inferred  by 
the  statistics  kept  in  each  axis  (if  independence  holds).  Thus  a  more  informa¬ 
tive  statistic  would  be  one  involving  rectangles  with  the  most  deviation  from 
independence,  according  to  a  threshold  T^..  Let  R^j  denote  rectangles  on  which 
we  keep  additional  information,  as  above.  We  can  use  this  information,  together 
with  the  statistics  in  each  axis,  to  provide  improved  approximations  of  the 
number  of  records  in  the  other  rectangles  of  the  subspace.  The  number  of 
records  in  a  rectangle  with  unknown  statistics  can  be  approximated  by 
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where  n*j  and  riiz  are  the  numbers  of  records  in  the  rectangles  with  known 
statistics  and  coordinate  intervals  in  the  one  axis  or  Jj  in  the  other  axis.  (This 
formula  assumes  two  clusters  in  the  subspace,  one  with  known  statistics,  and 
one  mthout.) 

To  illustrate  the  improvement  over  the  independence  assumption  consider 
an  example  shown  in  figure  6.4.2.  The  figure  shows  a  two-dimensional  space  w-ith 
three  possible  altribuLe  values  for  each  attribute.  The  file  has  6  records.  The 
numbers  in  each  rectangle  are  the  number  of  records  in  the  file  which  qualify  in 
a  conjunctive  query  on  the  two  attributes.  The  projections  in  each  axis  are  uni¬ 
form  distributions  and  since  the  correlations  of  attribute  values  in  this  space 
are  zero,  all  three  models  (uniformity  and  independence,  exact  statistics  in 
each  axis  and  independence,  and  the  parametric  approach  presented  earlier) 
will  give  the  same  estimated  record  selectivity  for  all  equality  queries  (2/3).  The 
maximum  error  of  the  independence  assumption  occurs  in  the  central  rectangle 
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Figure  6.4.2 

Example  of  improvement  over  the  independence  assumption 

(4/3).  Suppose  that  we  keep  additional  statistics  for  the  contents  of  this  rectan¬ 
gle  (nga  =  2).  If  we  do  that,  the  estimation  of  selectivities  using  the  above  for¬ 
mula  derives  the  exact  selectivity  for  all  possible  queries  in  this  two-dimensional 
space.  Thus,  in  this  example,  with  a  small  increase  in  the  storage  overhead  we 
have  a  large  increase  in  the  accuracy  of  estimation  of  selectivities  for  this  sub¬ 
space. 

We  have  used  the  data  of  the  inventory  file  to  compare  the  error  in  the  esti¬ 
mation  of  selecthdties  in  conjunctive  queries  of  a  model  that  uses  the  exact  dis¬ 
tributions  in  each  axis  and  assumes  independence  of  attribute  values  with  the 
error  in  the  estimation  of  selectivities  when  equation  7.4.1  is  used  instead  of  the 
independence  assumption.  All  conjunctive  queries  for  a  subset  of  the  attribute 
pairs  of  the  file  were  examined  and  the  average  normalized  relative  error  for 
each  pair  of  attributes  was  recorded.  Table  6.4.1  summarizes  the  results.  In 
table  6.4.1,  ATT  lists  the  pairs  of  attributes.  IND  is  the  average  error  for  all  con¬ 
junctive  queries  for  a  specific  pair  of  attributes  using  the  independence  assump- 


-  85  - 


tion.  THRIOO  lists  the  results  obtained  when  the  threshold  value  used  for  storing 
additional  information  (difference  of  the  estimated  selectivity  using  the  indepen¬ 
dence  assumption  and  the  actual  selectivity)  was  100.  NEX  is  the  number  of 
values  where  the  threshold  was  exceeded  (and  additional  information  was 
stored).  Finally  TKR50  lists  the  results  obtained  when  the  threshold  value  was 
50  records.  The  results  indicate  that  with  a  small  increase  in  the  amount  of 
information  stored,  a  large  improvement  in  the  estimation  of  selectivities  w^as 
achieved.  This  is  due  to  the  fact  that  the  distributions  in  the  attribute  spaces  of 
some  pairs  of  attributes  presented  high  peaks. 


Table  6.4.1 

ATT 

IND 

THRIOO 

THR50 

RER 

NEX 

RER 

NEX 

RER 

1,2 

.048 

3 

.027 

9 

.023 

2.3 

.085 

2 

.061 

7 

.037 

3.4 

.061 

0 

.061 

0 

.061 

4.5 

.059 

0 

.059 

0 

.059 

5.6 

.113 

3 

.105 

6 

.087 

6,7 

.088 

2 

.055 

3 

.055 

7.8 

.065 

1 

.060 

1 

.060 

8,9 

.070 

1 

.067 

4 

.046 

9,10 

.132 

2 

.123 

6 

.074 

10.11 

.171 

2 

.  144 

6 

.113 

This  approach  of  keeping  some  additional  statistics  in  important  subspaces 
to  improve  the  approximation  of  the  density  function  can  be  extended  to  impor¬ 
tant  subspaces  of  dimensionality  greater  than  tw’^o.  However,  the  profitability  of 
keeping  statistics  in  higher  dimensionalities  decreases  as  the  number  of  dimen¬ 
sions  increa.ses  because  the  number  of  records  in  the  hyperrectangles 
decreases  in  higher  dimensionalities  as  a  result  of  the  restrictions  imposed  by 
the  smaller  dimensionalities  as  mentioned  before. 

The  estimation  of  the  parameters  necessary  for  describing  the  distribution 
requires  more  than  one  pass  of  the  file.  In  the  first  pass,  the  one-dimensional 
statistics  can  be  gathered  and  the  interval  separations  can  be  determined.  The 


-  86  - 


maximum  width  determines  a  minimum  number  of  divisions  in  each  axis.  If  the 
divisions  in  one  axis  are  found  to  be  too  many,  the  threshold  value  can  be 
increased.  The  new  intervals  can  be  determined  from  the  summary  data  without 
the  need  of  a  second  pass.  In  a  second  pass,  the  actual  number  of  records  in  the 
candidate  rectangles  (rectangles  that  involve  a  large  number  of  records  accord¬ 
ing  to  the  inequality  riij  ^  min(7T..i,7ij))  will  be  found,  and  the  ones  with  large 
differences  over  the  estimation  of  independence  will  be  isolated.  A  triplet 
{a,b,p)  will  be  kept  for  each  of  them,  where  a  and  b  are  bit  maps  pointing  to  the 
corresponding  interval  in  each  axis  and  p  is  the  average  proportion  of  records 
for  equality  queries  inside  the  rectangle.  If  statistics  for  more  than  two  dimen¬ 
sions  at  a  time  are  desirable,  more  passes  are  necessary.  All  the  passes  how¬ 
ever,  do  not  have  to  be  done  at  the  same  time. 

In  summary,  the  discrete  model  emphasizes  the  need  to  keep  more  statis¬ 
tics  for  important  files,  attributes,  and  higher  subspaces.  It  avoids  the  excessive 
mapping  problem  by  providing  a  variable  width  histogram.  It  avoids  the  large 
errors  of  the  independence  assumption  by  providing  additional  informative 
statistics  in  important  subspaces.  It  can  approximate  the  density  distribution 
as  closely  as  is  desired  (depending  on  the  importance  of  the  file).  In  the  one 
limiting  case  it  will  give  detailed  statistics  for  every  possible  attribute  value 
combination,  in  the  other  it  will  be  equal  to  the  uniformity  and  independence 
assumption  model.  This  property  of  the  discrete  model  makes  it  also  appropri¬ 
ate  for  answering  queries  in  a  statistical  data  base  environment.  Statistical  data 
bases  typically  contain  a  large  number  of  records.  Moreover,  the  answer  to  a 
statistical  query  need  not  be  exact.  Thus  methods  similar  to  the  ones  described 
in  this  section  can  be  used  for  answering  queries  in  these  environments  without 
retrieving  the  actual  records  of  the  file.  The  statistics  of  the  model  are  easy  to 
derive.  They  may  require  more  than  one  file  pass  (depending  on  the  degree  of 
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accuracy  desired),  but  the  file  passes  are  independent  of  each  other.  (No  inter¬ 
mediate  results  have  to  be  stored.)  The  estimation  of  the  probability  density 
using  these  statistics  is  easy  and  it  does  not  require  the  estimation  of  distance 
functions  or  integration  over  some  dimensions,  as  do  other  non-parametric 
techniques.  Moreover,  the  estimation  of  selectivities  for  inequality  queries  is 
easier,  and  the  conditional  distributions  for  important  subspaces  can  be  easily 
found.  Finally,  the  param_eters  of  this  statistic  are  adaptive.  This  approach  is 
more  desirable  for  a  general  purpose  data  base  environm.ent  with  many  non¬ 
numeric  attributes  and  non-continuous  attribute  values.  However,  in  more  res¬ 
tricted  environments  (for  example  population,  numeric  attributes  and  clear 
clustering)  less  expensive  storage  methods  may  be  possible. 


6.5  Summary 

We  presented  in  this  chapter  a  multivariate  statistical  approach  for  the 
estimation  of  record  selectivities.  We  emphasized  the  need  to  keep  a  small 
number  of  parameters  only  for  approximating  the  overall  density.  For  real  data 
base  environments  which  involve  tens  or  hundreds  of  attributes,  this  approach  is 
mandatory.  We  first  described  a  parametric  approach  (continuous  model)  which 
uses  a  member  of  a  wide  range  probability  density  functions  to  describe  the 
density  in  the  attribute  space  and  showed  that  this  method  cein  be  used  to  pro¬ 
vide  better  approximations  of  the  record  selectivities.  Since  the  family  of  distri¬ 
butions  involves  unimodal  distributions  only,  the  method  is  most  appropriate  for 
data  bases  describing  populations  with  many  attributes  that  have  numeric  attri¬ 
bute  values  highly  correlated. 

For  a  general  data  base  environment  with  many  non-numeric  attributes,  we 
presented  an  alternative  approach  (discrete  model)  that  approximates  the  mul¬ 
tidimensional  histogram  by  keeping  some  particularly  informative  statistics. 
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This  approach  does  not  assume  continuity  of  the  density  distribution,  v/hich  may 
be  unrealistic  in  many  data  base  environments,  and  emphasizes  the  need  to 
keep  more  statistics  for  more  important  attributes.  Moreover,  the  discrete 
model  is  easy  to  implement  and  use.  The  model  can  also  be  used  with  continu¬ 
ous,  highly-correlated  attributes,  but  it  requires  the  storage  of  more  informa¬ 
tion  than  the  continuous  model. 
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Chapter  7:  A  Model  of  Block  Sole etivi ties 


In  this  chapter,  we  present  a  model  for  estimating  block  selectivities.  The 
models  presented  by  Cardenas  and  Yao  ([Cardenas  1975],  [Yao  1977])  are  shown 
to  be  special  cases  of  this  model.  Finallj''  we  present  more  efficient  methods  for 
estimating  block  selectivities  under  alternative  clustering  as  well  as  methods  for 
estimating  average  block  selectivities. 

7.1  Estimation  of  Block  Selectivities 

In  this  section  we  present  a  model  for  estimating  block  selectivities  and 
compare  our  model  with  the  existing  ones.  We  first  extend  the  multivariate  sta¬ 
tistical  model  presented  in  the  previous  chapter  to  include  information  on  the 
record  placement  in  the  file.  Then  we  show  how  to  estimate  block  selectivities 
based  on  this  extended  model. 

We  define  the  linear  address  r  of  a  record  in  a  file  to  be  the  sequential  posi¬ 
tion  of  the  record  in  the  file.  W^e  define  the  Linear  address  space  of  a  file  to  be 
the  real  line  containing  the  linear  addresses  of  the  records  of  the  file.  The  linear 
address  space  is  divided  into  ordered  page  inle'rvaLs  that  correspond  to  the 
pages  of  the  file.  A  set  of  qualifying  records  corresponds  to  a  set  of  linear 
addresses  in  the  linear  address  space.  This  set  of  linear  addresses  determines  a 
set  of  page  intervals  that  contains  them.  The  number  of  blocks  in  the  file  that 
contain  the  set  of  qualifying  records  corresponds  to  the  size  of  this  set  of  dis¬ 
tinct  page  intervals  in  our  model.  Letpg(a:)  be  a  distribution  density  function 
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describing  the  likelihood  that  a  record  “with  linear  address  x  qualifies  to  the 
query  q.  It  is  necessary  that,  given  Pg(x)  and  the  number  of  qualifying  records, 
we  are  able  to  estimate  the  expected  number  of  page  intervals  containing  the 
linear  addresses  of  these  records.  In  the  following,  we  will  first  show  how  to  esti¬ 
mate  the  number  of  qualifying  records  in  a  query  g,  as  well  as  the  distribution 
density  function  p^{x).  Then,  given  pg{x),  we  will  show  how  to  estimate  the 
expected  number  of  blocks  containing  the  qualifying  records. 

The  linear  addresses  of  the  records  of  a  file  can  be  seen  as  the  values  of  an 
attribute  called  LINEAR-AD  DRESS.  We  view  the  records  of  a  file  as  points  in  an 
extended  (n+ 1  )-dimensional  atti'ibute  space,  where  n  is  the  num.ber  of  attri¬ 
butes  in  the  file,  and  the  (n  +  l)sL  attribute  is  the  attribute  LINEAR-ADDRESS. 
Each  attribute  corresponds  to  a  different  dimension  in  this  space.  The  dimen¬ 
sion  that  corresponds  to  the  attribute  LI  NEAR-AD  DRESS  is  the  linear  address 
dimension  (linear  address  space)  of  the  file.  A  record  of  the  file  is  mapped  into 
a  point  in  this  space. 

We  can  describe  the  distribution  of  the  points  in  the  extended  attribute 
space  with  a  multivariate  probability  density  function  7?(x).  It  should  be  men¬ 
tioned  here  that  to  estimate  block  seiectivities  without  relying  on  the  random 
placement  assumption,  it  is  essential  that  we  use  a  multivariate  probability  den¬ 
sity  function.  If  we  assume  independence  between  the  linear  address  and  the 
other  attributes,  the  result  will  be  equivalent  to  the  random  placemient  assump¬ 
tion  as  discussed  in  Chapter  5.  However,  no  other  assumptions  are  necessary. 
Thus,  the  results  of  this  section  are  independent  of  the  specific  family  of  mul¬ 
tivariate  distributions  used  to  describe  the  extended  attribute  space.  The 
parameters  of  the  multivariate  probability  density  function  p  (x)  can  be 
estimated  from  the  attribute  values  and  the  linear  addresses  of  the  records  by 
sequentially  scanning  the  file.  A  conjunctive  query  q  on  the  file  selects  a  number 
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71  of  qualifying  records  from  the  file.  The  expected  number  of  records  qualifying 
in  the  query  can  be  found  from  q,  p(x),  and  the  number  N  of  the  records  in  the 
file  as  it  wa.s  described  in  chapter  6.  The  distribution  density  function  Pq{x)  of 
the  records  qualifying  in  the  query  in  the  linear  address  space  of  the  file  can  be 
found  as  the  conditional  distribiition  on  the  linear  address  dimension  for  the 
attribute  values  specified  in  q. 

To  illustrate  these  ideas  consider  a  file  F  clustered  on  an  attribute  A.  The 
distribution  of  points  in  the  extended  attribute  space  which  is  formed  by  the 
attribute  A  and  the  address  space  of  the  file  is  shown  in  figure  7.1,1.  The  density 
distribution  of  the  records  that  qualify  in  the  query  A=‘'a‘',  in  the  address  space 
of  the  file,  is  non-zero  only  in  section  s  of  the  address  space. 


t 


Attribute  A 
values 


Address  space 

Figure  7.1.1 

Extended  attribute  space  of  the  clustering  attribute  A. 

Consider  now  another  attribute  B  of  the  file  with  attribute  values  correlated 
with  the  values  of  A.  The  density  distribution  in  the  extended  attribute  space 
formed  from  the  attribute  B  and  the  address  space  of  the  file  is  shown  in  figure 
7,1.2.  The  distribution  of  the  qualifying  records  in  a  query  B  =  "b"  in  the  address 
space  of  the  file  is  non-uniform  over  the  address  space  of  the  file. 
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Attribute  B 
values 


Figure  7.1.2 

Extended  attribute  space  of  the  attribute  B. 

We  show  next  hov/-  to  estimate  the  expected  number  of  blocks  containing 
the  set  of  qualifying  records  in  a  query,  given  the  distribution  density  function 
Pq{x)  of  the  qualifying  records  in  the  linear  address  space  of  the  file.  We 
analyze  this  problem  as  a  selection  problem:  A.  query  q  on  the  file  selects  a  set  of 
n  qualifying  records  from  the  file.  We  assume  that  the  likelihood  of  a  record 
with  linear  address  x  qualifying  for  the  query  q  is  equal  to  Pq(x)  and  estimate 
the  expected  number  of  distinct  page  intervals  containing  these  records.  Thus, 
our  model  takes  into  account  the  fact  that  the  likelihood  of  a  record  qualifying 
for  a  query  q  is  not  uniform  over  all  the  records  of  the  file. 

To  formalize  our  problem  as  a  probability  problem,  we  consider  the  follow¬ 
ing  related  experiment:  We  retrieve  with  replacement  ?i  records  from  the  file. 
The  likelihood  that  the  record  r^.  with  linear  address  x  is  retrieved  in  one  recoi'd 
retrieval  is  pq{x).  We  discard  the  selections  where  i.he  number  of  selected 
records  is  less  than  n.  An  outcome  of  our  experiment  is  a  set  D.i  of  n  distinct 
records  from  the  file.  The  universe  of  all  possible  outcomes  is  the  set  D  of  all 
the  sets  Di  of  n  distinct  records  from  the  file.  The  expected  number  of  distinct 


page  iriLervals  containing  an  outcome  of  our  experiment  is  given  by 


-  S 

D^eD 

where  p/j.  is  the  probability  of  the  outcome  Di  and  is  the  number  of  distinct 

page  intervals  containing  the  linear  address  spaces  of  the  records  in  D^.  We  pos¬ 
tulate  that  the  number  Bq  is  the  expected  number  of  blocks  in  the  file  contain¬ 
ing  the  set  of  records  qualifying  in  the  query  q.  We  need  to  estimate  from 

our  experiment.  The  probabilit)?-  that  a  set  s-,  (of  not  necessarily  distinct 
records)  is  selected  in  n  record  retrievals  with  replacement  is  given  by 

Psi  ~  Yl  Pg{^) 

where  Pq{x)  is  the  likelihood  that  the  record  Tx  with  linear  address  x  is  selected. 
This  is  a  consequence  of  the  fact  that  the  n  selections  are  done  with  replace¬ 
ment,  so  that  the  likelihood  of  a  record  being  selected  in  any  of  the  n  retrievals 
remains  constant.  The  probability  that  contains  n  distinct  records  is  given  by 

pW)  -  n  Pqi^) 

c  ,  r  n  r* 

'‘I’'" 

Therefore  the  probability  that  the  outcome  of  our  experiment  is  the  set  of  n  dis¬ 
tinct  records  Si  is  given  by 

Pd,  ^p{si\  Si  e  £>)  =  ^  TTV^y 

s-eD  r^ts^ 

Formulae  7.1.1  and  7.1.2  provide  the  expected  number  of  blocks  containing  the 
records  of  the  file  qualifying  in  the  query  q. 

For  the  case  that  Pq{x)  is  uniform  over  the  address  space  of  the  file,  we  can 
show  that  this  result  reduces  to  Yao’s  result  [Yao  1977].  By  grouping  ail  the 
selections  that  retrieve  at  least  one  record  from  a  given  block  i  vre  can  obtain 
the  probability  that  the  block  i  will  not  be  selected.  Thus: 
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For  uniform  PqCx)  we  obtain 


nr'N  —  h 


Pn^  = 


(i/MTcSi 


c. 


A  “  b 

71 


C. 


N 

n 


where  stands  for  the  combinations  of  N  objects  over  n.  The  expected  block 
selectivity  therefore  is: 


pN—b 


'n 


which  is  Yao’s  formula.  Thus  the  above  solution,  although  independently 
derived,  reduces  to  Yao’s  formula  for  uniform  record  distributions  over  the  file’s 
address  space.  However,  for  non-uniform  likelihood  distributions  this  solution 
requires  that  all  the  selections  of  n  distinct  records  from  the  file  be  exam¬ 
ined  and  and  Bj)^  be  computed.  This  is  an  expensive  computation  for  non¬ 
trivial  file  sizes. 

As  a  reasonable  approximation  to  the  solution,  we  will  assume  that  the  likel¬ 
ihood  of  a  record  being  selected  is  the  same  for  all  the  records  of  a  block.  For 
reasonably  smooth  likelihood  functions,  this  approximation  generally  does  not 
introduce  significant  errors.  Under  this  assumption,  all  selections  of  n  distinct 
records  from  the  file  that  result  in  the  same  set  of  ordered  pairs  where 

fc  is  a  page  interval  number  and  j)^  is  the  number  of  distinct  records  selected 
from  this  interval,  have  the  same  likelihood  of  being  selected.  There  are  fT  Cj^ 

k  Esb 

such  selections  of  distinct  records  that  result  in  the  same  set  where  b  is 

the  number  of  records  per  block  and  sb  is  the  set  of  blocks  specified  in  [{k.jf.)]. 
We  will  examine  all  the  selections  that  result  in  the  same  set  Wi  =  [(k.Jk)]  21s  one 
class  of  selections  s/^.  The  solution  becomes 


T.1.3 

Recently  B.  Bell  [Bell  1981]  has  shown  that  by  storing  temporary  resi-ilts  of  the  order  N*7l 
the  number  of  calculations  l  equired  for  estimating  7.1.1  using  7.1.2  is  of  the  order  N*7l.  Tliis 
is  still  expensive  for  large  files  and  large  record  selectivitiea. 
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p^H=  - -  7.1.4 

s  n 

w^cH  {lc,j^.)tWi 

where  ^  is  the  average  linear  address  space  of  the  fc’th  page  interval,  Pq^(F^)  is 
the  Jjfc'th  power  of  the  likelihood  in  the  linear  address  F*,  and  H  is  the  set  of  all 
possible  selections  from  the  file.  This  'soft'  approximation  reduces  the 
number  of  possible  sets  of  ri  reeords  that  have  to  be  examined  from  \D  \  to  \H  \  . 
However,  it  may  still  be  expensive  for  large  files.  The  size  of  H  is  greater  than 

minJn.M) 

\H\^  V 

i-\n/b\ 

where  b  is  the  number  of  records  per  block,  and  M  is  the  number  of  blocks  in 
the  file.  The  sum  is  the  number  of  different  choices  of  blocks  of  the  file  that  may 
contain  n  distinct  records.  However,  for  a  given  choice  of  blocks,  the  number  of 
reeords  in  a  specific  block  may  vary,  and  thus  more  than  one  selection  may 
appear  in  H.  For  example,  for  a  file  of  10000  records  with  a  blocking  factor  of  40 

250 

records  per  block  and  500  qualifying  records,  the  size  of  if  is:  ^ 

i=13 

which  is  a  large  number  (C 

In  the  following,  we  will  present  a  replacement  model.  We  will  assume  that 
the  likelihood  Pg (a; )  of  a  record  of  being  selected  does  not  change  with  the  selec¬ 
tion  of  other  records  from  the  file.  In  terms  of  the  experiment  described  before, 
this  assumption  corresponds  to  estimating  Bq  from  all  the  possible  outcomes  s^, 
not  only  those  containing  distinct  records.  To  provide  an  estimation  of  the 
number  of  blocks  containing  the  qualifying  records,  we  observe  that  we  can  con¬ 
centrate  our  attention  on  one  particular  block  at  a  time  since  pg(x)  does  not 
depend  on  the  retrieval  of  other  records  from  the  file.  Let  Pq(i)  be  the  probabil¬ 
ity  that  a  record  from  the  block  i  will  be  retrieved  in  a  record  retrieval.  The 
probability  that  a  block  i  will  not  be  retrieved  in  n  record  retrievals  is 
(l-Pg('i))”’.  The  expected  number  of  blocks  qualifying  in  a  query  is  given  by 
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^{1-{1-P,(i)r  7.1.5 

■1  =  1 

where  M  is  the  number  of  blocks  in  the  file. 


In  the  case  that  Pq{i)  is  uniform  over  all  the  address  space  of  the  file,  the 
expected  number  of  block  accesses  becomes  Bq  =  M (1  -  (1  -  where  M  is 

the  number  of  blocks  in  the  file.  This  is  the  formula  presented  by  Cardenas 
[Cardenas  1975].  Formula  7.1.5  is  inexpensive  to  compute.  However,  it  gives  an 


underestimation  of  the  expected  number  of  block  accesses  because  the  condi¬ 


tional  probability  that  a  record  will  be  selected  in  a  record  selection  from  the 


file,  given  that  certain  other  records  have  already  been  selected,  is  constant 


under  the  replacement  model,  while  it  increases  under  a  non-replacement 


model. 


Next,  we  derive  approximations  of  the  expected  number  of  block  accesses 
of  the  exact  (non-replacement)  model.  Consider  the  retrieval  of  n  records  from 


a  file  of  size  N  without  replacement.  At  the  first  record  retrieval,  the  probability 


that  a  block  will  be  retrieved  is  Pq{i).  However,  after  k  record  retrievals  from 
other  blocks  of  the  file,  the  probability  that  block  i  is  retrieved  increases.  The 
conditional  probability  (given  that  k  records  have  been  retrieved  from  the  rest 


blocks  of  the  file)  is - ^ - 

1- 

j=i 

selected  records.  Therefore  the  expected  number  of  blocks  containing  the  qual¬ 


ifying  records  is 


i/  71-1  P  (i) 

s,  =£'(E((l--Ps(i))ri{l - - ))) 

y  =  l 

M  n^\  pji) 

=  E{^((i-^q(^))11(i - — ))) 

where  E  denotes  expectation.  We  can  obtain  upper  and  lower  bounds  on  the 

* 

expected  number  of  block  accesses  by  finding  the  limits  of  Y^Pq{xj).  Then 

i=i 


k 

,  where  total  likelihood  of  the  k 

i=i 
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n(i) 


i=i  i;=i  l-fcmin(p,{i3)) 


))S  B, 


and 


B,  s  E((i--P,(»))Ti(i--i-T 

1  =  1  k  =  l  -L  ^ 


■)) 


max{pg{xj)) 

where  niin(pq(xj))  and  max{pq{xj))  is  the  minimum  and  the  maximum  values  of 
the  likelihood  density  function  in  the  address  space  of  the  file.  When  m.in(pq(xj)) 
is  zero,  the  lower  bound  of  the  expected  block  accesses  becomes  identical  to  the 
result  of  the  replacement  model.  For  uniform  distributions  in  the  address 
space,  min(pg  (x^))  =  max(pg(a:^))  =  1/iV.  In  this  case,  the  two  bounds  coincide 
with  the  result  of  the  exact  (non  replacement)  model  (see  Appendix  3). 

The  bounds  on  the  expected  block  accesses  can  be  improved  for  nearly 

symmetric  unimodal  distributions  by  considering  more  detailed  approximations 

k 

of  the  limits  of  Pq{xj).  This  can  be  done  for  the  upper  bound  case  by  consid- 
Bring  that  the  k  selected  records  were  the  nearest  records  to  the  mean  and 

m+k  /2 

integrating  Pq(x)  (in  this  case  the  sum  is  approximated  by  2  J  Pq{x)dx 

m 

instead  of  kpq{m),  where  m  is  the  mean  of  the  distribution).  The  above  observa¬ 
tions  about  the  bounds  of  the  expected  block  accesses  suggest  that,  for  unimo¬ 
dal  distributions,  a  reasonable  approximation  of  block  accesses  can  be  obtained 

k 

by  substituting  for  expected  value  from  the  likelihood  distribution. 

i  =  i 

This  approximation  is  exact  for  uniform  distributions.  It  is  also  accurate  as  k 

approaches  n.  For  small  values  of  k,  the  distribution  of  the  sample  of  k  records 

may  be  different  than  the  distribution  of  the  n  qualifying  records.  However,  for 
k 

small  k,  ^Pq(xj)  is  small,  and  thus  affects  only  slightly  the  probability  that 
j=i 

Pq  (i)  ,  . 

- - - ^  Pq{i).  Since  for  large  k  the  assumption  is 

1-  '^pqixj) 


3=1 


block  i  is  selected: 
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realistic,  we  should  expect  that  this  approximation  will  give  good  results.  The 
result  becomes: 


^  P„(i) 

B,  =  E((i-^,(o)n(i - — 


)) 


M 


n-1 


E((i-^5(o)n(i- 


j=i 

Pad) 


) 


i  =  l  A:  =  l  1  ^  )  ) 

since  the  first  derivative  at  the  mean  for  nearly  symmetric  distributions  is  zero. 


Or 


M  n-1  p  (i) 

B,  =  E((l-^’,(i))n(l - ^  , 

where  a  is  the  standard  deviation  of  the  likelihood  distribution.  A  computation¬ 
ally  simpler  formula  for  this  result  is  presented  in  Appendix  4.  The  result 
requires  only  a  summation  over  the  number  of  blocks  in  the  file,  and  thus  it  is 
inexpensive  to  compute.  In  certain  cases  it  may  be  advantageous  to  consider 
sub-sections  of  the  address  space  where  the  probability  density  is  approxi¬ 
mately  constant,  and  estimate  the  block  selectmty  as  summation  over  these 
intervals.  This  will  reduce  the  number  of  terms  in  the  summation.  This 
approach  is  also  more  appropriate  when  non-unimodal  probability  functions  are 
used  for  approximating  the  density  in  the  extended  space.  A  special  case  is 
clustered  files  which  we  will  consider  later. 


Figures  7.1.3  and  7.1.4  show  the  dependence  of  the  expected  block  selec¬ 
tivity  on  the  standard  deviation  and  the  shape  of  the  likelihood  probability  den¬ 
sity  function  pq{x)  for  Pearson  Type  2  symmetric  unimodal  distributions  [Elder- 
ton  1953].  The  parameter  k  indicates  the  shape  of  the  distribution  from^  uniform 
to  Normal  as  k  goes  from  zero  to  infinity,  while  the  parameter  lu  indicates  the 

range  of  the  distribution  - 5-,  where  cP  is  the  standard  deviation). 

(2k+3)(P 

The  lower  and  the  upper  bounds  of  the  expected  block  accesses  for  all  the 
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placements  are  also  shown,  as  well  as  the  the  results  of  the  random  placement 
models.  It  can  be  seen  from  these  figures  that  the  discrepancies  between  the 
results  of  the  replacement  and  the  non-replacement  models  are  small  in  com¬ 
parison  to  the  errors  that  can  be  introduced  by  the  random  placement  assump¬ 
tion,  when  this  assumption  is  not  satisfied. 

Figures  7.1.5  and  7.1.6  compare  the  results  of  the  exact  model  with  the 
results  of  our  approximate  model  as  well  as  with  the  results  of  the  previous 
models.  The  comparison  is  done  for  various  values  of  the  correlation  coefficient 
between  the  attribute  values  and  the  address  space  values.  As  can  be  seen 
from  the  diagrams,  the  relative  error  of  the  random  selection  assumption 
models  becomes  significant  for  correlations  larger  than  .6.  Thus  the  use  of  a 
more  detailed  model  will  be  useful  if  the  correlation  in  the  extended  space  is 
expected  to  exceed  this  value. 

In  figure  7.1.6,  it  is  shown  that  the  relative  error  of  Cardenas’  formula  has  a 
minimum.  This  minimum  can  be  explained  by  the  fact  that  Cardenas’  formula 
assumes  random  placement  of  the  qualifying  records,  as  well  as  replacement  of 
the  retrieved  records.  When  the  random  placement  assumption  is  true  (for 
7'  =  0),  this  formula  is  an  underestimation  of  block  accesses  because  of  the 
replacement  assumption,  and  therefore,  the  relative  error  is  non-zero.  As  the 
correlation  increases,  the  random  placement  assumption  becomes  unrealistic 
because  the  number  of  blocks  of  the  file  where  the  probability  of  retrieving 
records  is  non-zero  decreases.  Thus  the  actual  expected  number  of  blocks  to  be 
retrieved  decreases  as  the  correlation  increases.  The  minimum  relative  error  in 
Cardenas’  formula  (which  is  zero)  is  achieved  w^hen  the  two  estimations  are 
equal.  On  the  other  hand,  Yao’s  formula  is  accurate  for  r  =  0,  hut  it  is  an  overes¬ 
timation  when  the  random  selection  assumption  is  not  true  (?’!  =  0).  Thus  its 
relative  error  starts  from  zero  and  increases  as  the  correlation  increases. 
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Figiix  e  7.1.3 

Block  selectivity  as  function  of  the  shape  of  ’Pqix).  The  kurtosis  of  -po^x)  in¬ 
creases  from  A:=0  (uniform)  to  fc  =  10  (unimodal). 


Figure  7.1.4 

Block  selectivity  as  function  of  the  range  of  pq(x).  High  values  of  W  indicate 
small  potion  of  the  address  space  where  the  probability  is  non-zero  (uniform). 
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Yao 


Figure  7.1.5 

Block  selectivities  in  uniform  distributions 


Relative 

error 


Figure  7.1.6 

Relative  error  of  block  selectivities  in  uniform  distributions 
Since  the  exact  model  is  very  expensive  for  non-uniform  distributions,  a 
simulation  was  used  for  examining  the  approximations  provided  from  our 
approximate  model  for  Normal  distributions.  Figures  7.1,7  and  7.1.8  show  that 
the  results  of  the  simulation  were  remarkably  close  to  the  results  of  the  approx- 
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imate  model. 


Figure  7. 1.7 

Block  selectivities  for  Normal  distributions. 


Relative 

error 


Figure  7.1.8 

Relative  error  in  block  selectivities  for  Normal  distributions. 


-  103  - 


7.2  Estimation  of  Block  Selectivities  for  Alternative  Clustering 

Although  the  previous  model  is  useful  for  the  estimation  of  the  number  of 
blocks  that  contain  the  records  qualifying  in  a  query,  in  a  certain  case  it  may  be 
expensive  to  estimate  the  parameters  G[pq{z).  This  is  the  case  when  the  perfor¬ 
mance  of  the  file  under  alternative  clusterings  is  considered.  In  this  case,  the 
above  approach  requires  that  the  file  be  sorted  according  to  the  values  of  the 
clustering  attribute,  and  that  the  parameters  of  Pq{x)  be  estimated  for  each 
ordering.  This  is  an  expensive  operation.  We  note  here  that,  under  the  random 
placement  m.odel,  the  expected  number  of  blocks  containing  the  records  quali¬ 
fying  in  a  query  is  the  same  under  all  alternative  clusterings  if  the  primary  attri¬ 
bute  is  not  referred  to  in  the  query.  In  the  following,  we  will  show  an  alternative 
method  of  estimating  the  expected  number  of  block  accesses  for  clustered  files. 

Let  A  be  the  clustering  attribute  of  the  file.  If  the  query  contains  a  conjunc¬ 
tive  term  A='a’  for  the  attribute  A  (in  addition  to  the  values  of  the  other  attri¬ 
butes),  then  we  will  assume  that  the  records  qualifying  in  the  query  are  ran¬ 
domly  selected  from  the  blocks  of  the  file  that  have  records  with  attribute  A 
values  equal  to  ’a’.  Thus  the  expected  number  of  blocks  that  contain  records 
qualifying  in  the  query  is  given  by 


where  n  is  the  number  of  records  qualifying  in  the  query  and  k  is  the  number  of 
blocks  of  the  file  that  have  attribute  k  value  equal  to  ’a’,  ks  before,  n  can  be 
estimated  from  the  probability  distribution  p(x)  and  the  number  N  of  the 
records  in  the  file.  Thus  n  depends  on  the  correlation  betvv’-een  the  clustering 
attribute  and  the  other  attributes  in  the  query.  The  parameter  k  can  be 
estimated  as  E  {A '=■' a' )/b,  where  E(A='a.')  is  the  expected  number  of  records 
qualifying  in  the  query  A=’a’  and  b  is  the  number  of  records  per  block. 


In  the  case  that  a  query  on  the  file  does  not  specify  a  value  for  the  attribute 
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A,  a  summation  over  all  the  values  of  the  attribute  A  is  required.  The  result  is 

B,  =  7.2.1 

a 

where  is  the  number  of  records  qualifying  in  a  query  containing  a  conjunction 
of  the  original  query  with  the  query  A=’a',  for  a  value  ’a’  of  the  attribute  A;  is 
the  number  of  blocks  containing  the  records  qualifying  in  the  query  A=’a'.  In 
the  case  that  the  clustering  attribute  A  has  a  large  number  of  distinct  values, 
the  summation  can  be  done  in  ranges  of  values  of  A,  instead  of  single  values,  to 
make  the  calculations  less  expensive. 

It  was  shown  in  the  previous  sections  that  the  expected  block  selectivity  is 
smaller  for  highly  peaked  distributions  in  the  address  space  of  the  file.  There¬ 
fore  the  block  selectivity  of  attributes  highly  correlated  with  the  clustering 
attribute  will  be  small  since  for  high  correlations  the  conditional  distributions 
will  have  small  variance.  This  suggests  that  correlations  and  other  dependencies 
between  the  clustering  attribute  and  the  other  attributes  are  important  for 
evaluating  the  performance  of  clustered  files. 

To  show  the  effect  of  correlations  on  the  performance  of  clustered  files, 
consider  an  example  from  the  engihfeering  report.  We  assume  that  the  engineer¬ 
ing  data  base  is  clustered  on  the  attribute  YEAR-OF-GRADUATION.  We  consider 
queries  that  ask  for  the  engineers  in  various  responsibility  levels.  The  summary 
statistics  in  the  report  can  be  used  to  find  the  maximum  number  of  blocks  of 
the  file  that  may  contain  engineers  of  a  specific  responsibility  level  for  a  given 
blocking  factor.  (See  also  table  4.3.4.)  The  results  are  summarized  in  Table 
7.2.1,  for  a  blocking  factor  of  40  l  ecords  pei’  block. 

In  the  table  RESL  is  the  responsibility  level  and  MAXB  is  the  maximum 
number  of  blocks  containing  records  in  the  corresponding  responsibility  level. 
However,  due  to  the  high  concentration  of  qualifying  records  in  small  number  of 
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Table  7.2.1 

RESL 

MAXB 

APPR 

RAPL 

A 

131 

189 

353 

R 

258 

275 

358 

C 

341 

323 

362 

D 

336 

346 

362 

E 

305 

297 

ooo 

ouc 

F 

231 

180 

341 

blocks,  the  exact  number  of  blocks  will  be  very  near  to  the  maximum.  RAPL  is 
the  estimated  number  of  blocks  containing  qualif5dng  records  using  the  random 
placement  assumption  (and  the  exact  number  of  qualifying  records  in  each 
query:  using  uniformity  and  random  placement  the  expected  number  of  qualify¬ 
ing  blocks  is  alwa3/-s  ail  the  362  blocks  of  the  file).  Finall)'’  APPR  is  the  expected 
number  of  blocks  using  the  methods  of  this  section  to  approximate  block  selec- 
tivities.  The  estimation  of  record  selectivities  was  done  using  the  family  of 
parametric  distributions  described  before.  As  can  be  seen  from  this  table,  the 
block  selectivities  estimated  by  the  random  placement  assumption  model  are 
pessimistic  and  they  result  in  a  sequential  scan  strategy  for  all  the  queries 
posed.  However,  at  least  half  of  these  queries  (responsibility  levels  A,  B,  and  F) 
would  be  evaluated  considerably  more  inexpensively  in  terms  of  block  accesses 
by  using  the  index  on  the  attribute  RESPONSIBILITY-LEVEL. 

This  example  shows  that  the  query  evaluation  strategy  in  clustered  files 
depends  on  the  correlation  of  the  clustering  attribute  and  the  other  attributes 
(a  sequential  scan  would.be  profitable  for  all  queries  on  responsibility  level  if  the 
two  attributes  were  independent).  Moreover,  this  example  shows  that  the  choice 
of  an  attribute  for  clustering  should  not  only  be  based  on  the  number  of  queries 
referring  to  the  attribute  itself,  but  the  impact  of  clustering  on  the  other  attri¬ 
butes  should  also  be  considered.  Finally,  this  example  shows  that  in  clustered 
files  the  profitability  (and  therefore  the  selection)  of  indices  and  links  depends 
on  the  correlations  between  the  clustering  attribute  and  the  other  attributes. 
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7.3  Esiimatiou  of  Average  Block  Selectivities 

To  estimate  the  average  block  selectivities,  several  factors  are  important: 
the  distribution  of  attribute  values  in  the  file,  the  data  placement  among  the 
blocks  of  the  file,  and  the  distribvjtion  of  attribute  values  in  the  user  queries. 
The  average  block  selectivities  for  conjunctive  equality  queries  in  a  subspace  m 
can  be  approximated  by 

j  i  =  l 

where  is  the  probability  that  a  conjunction  j  appears  in  a  query  in  this  sub¬ 
space,  Pi\j  is  the  probability  that  a  record  is  retrieved  from  the  block  i  in  one 
record  retrieval  from  the  file,  and  is  the  number  of  records  qualifying  in  l.he 
query.  This  equation  assumes  replacement.  The  double  summation  in  the 
expression  may  be  expensive  for  large  files.  Howev'er,  the  cost  of  the  summation 
can  be  reduced  by  considering  Pi\^j  to  be  a  histogram  of  a  small  number  of  inter¬ 
vals.  In  this  case,  the  summation  will  involve  subsections  of  the  address  space  of 
the  file  where  the  probability  for  a  record  to  qualify  in  a  query  will  be  constant. 
These  subsections  can  be  found  by  the  shape  of  the  conditional  distribution  Pi\j- 
The  result  is  expressed  by: 

-  (1  - 

3  i  i 

where  riij  is  the  number  of  records  that  is  placed  in  subsection  i  of  the  address 
space.  A  special  case  of  this  formula  is  the  estimation  of  average  block  selectivi¬ 
ties  in  clustered  files.  The  result  v.'ill  be 

E?,fc(i-(i-  7)"') 

j  ^ 

if  the  clustering  attribute  A  appears  in  the  conjunction  (Ic  is  the  number  of 
blocks  in  the  file  with  the  attribute  value  of  A  that  appears  in  the  conjunction), 
or 

-  (1  - 

}  a 
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if  the  clustering  attribute  does  not  appear  in  the  query  (a  is  a  value  of  the  clus¬ 
tering  attribute,  and  n^j  is  the  number  of  records  in  the  file  qualifying  in  a  query 
containing  a  conjunction  on  the  original  query  with  the  query  =  'a'). 

For  files  where  the  random  selection  assumption  can  be  justified,  other 
computational  and  storage  saving  simplifications  may  be  used.  The  attribute 
values  appearing  in  conjunctions  may  be  divided  into  three  classes:  small 
number  of  records  qualifying,  intermediate  number  of  records  qualifying,  and 
large  number  of  records  qualifying.  For  attribute  values  in  the  "small"  class, 
every  qualifying  record  belongs  to  a  separate  block.  Therefore  the  cost  is 
^Qjs'^js-  values  in  the  "large"  class  the  cost  is  For  values 

in  the  "intermediate"  class  the  cost  is  Y^qjiM{l  -  (1  -  If  queries  are  uni- 

j 

form,  detailed  statistics  for  only  the  records  of  the  interm.ediate  class  need  to 
be  kept.  The  approximate  boundaries  of  these  classes  can  be  found  from  the 
graphs  presented  in  the  previous  section. 


7.4  Summary 

In  this  chapter  we  examined  the  problem  of  estimating  block  selectivities. 
Previous  methods  for  estimating  block  selectivities  are  based  on  a  model  of  ran¬ 
dom  placement  of  qualifying  records  among  the  blocks  of  a  file.  We  presented  a 
more  general  model  for  estimating  block  selectivities  which  is  able  to  take  into 
account  non-random  placement  of  the  qualifying  records  among  the  blocks  of 
the  file,  and  we  showed  how  to  estimate  block  selectivities  based  on  this  model. 
Comparison  of  this  model  with  the  pre\n.ou3  models  showed  that  when  the  ran¬ 
dom  placement  assumption  is  not  satisfied,  large  errors  may  result. 

Clustering  is  a  very  important  method  for  improving  data  base  perfor¬ 
mance.  However,  a  variety  of  data  base  design  problems  (like  query  processing. 


-  108  - 


index  selection,  and  link  selection)  have  been  studied  for  non-clustered  files  only 
[Schkolnick  1978].  Moreover,  the  selection  of  a  clustering  attribute  has  only 
been  based  on  the  assumption  that  queries  on  the  other  attributes  of  the  file  are 
not  affected  by  the  choice  of  the  clustering  attribute.  We  showed  that  this  is  not 
true.  Our  model  is  appropriate  for  generalizing  the  existing  solutions  to 
clustered  file  environments. 

W"e  also  presented  a  less  expensive  method  for-  the  estimation  of  block 
selectivities  under  alternative  clusterings.  To  our  knowledge,  there  are  no  exist¬ 
ing  models  that  are  able  to  compare  the  effect  of  alternative  clusterings  on  data 
base  performance.  In  our  model,  the  expected  number  of  block  accesses 
depends  on  the  correlations  of  the  clustering  attribute  with  the  other  attributes 
in  the  file. 

Finally,  methods  for  estimating  average  block  selectivities  were  discussed. 
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Chapter  8:  Applications  of  Improved  Approximations 


In  this  chapter  we  present  applications  of  the  improved  approximations  of 
record  and  block  selectivities  in  query  evaluation,  index  selection,  and  data  base 
design  performance  prediction. 

8.1  Query  Evaluation 

In  this  section,  we  discuss  the  application  of  the  improved  approximations 
in  the  query  evaluation  problem.  Our  objective  is  to  illustrate  where  the 
improved  approximations  contribute  to  selecting  an  improved  query  evaluation 
strategy.  We  first  examine  query  processing  in  flat  file  environments  and 
present  algorithms  that  take  advantage  of  the  improved  approximations  for 
selecting  a  good  query  evaluation  strategy.  Wm  present  examples  and  experi¬ 
ments  to  show  the  savings  that  may  occur,  and  we  indicate  where  this  approach 
is  more  useful  than  one  that  uses  a  simpler  model  of  data  base  contents  and 
data  placement  on  devices.  The  class  of  queries  examined  in  this  section  is 
wider  than  in  previous  research.  lYe  then  give  examples  of  selecting  a  good 
query  evaluation  strategy  in  a  distributed  environment  where  a  copy  of  the  same 
file  may  exist  in  two  sites;.  Finally,  we  present  examples  of  the  use  of  the 
improved  approximations  in  query  processing  in  multi-file  environments. 
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Flat  File  Environments 

In  this  section  we  present  query  processing  in  a  single  inverted  file  environ¬ 
ment.  The  classes  of  queries  that  we  consider  are  single  attribute  conditions, 
conjunctive  conditions,  disjunctive  conditions,  conjunctive  normal  forms,  dis¬ 
junctive  normal  forms,  and  simple  num±cric  conditions. 


Single  attribute  conditions 

Consider  an  attribute  condition  on  a  single  attribute  A.  If  there  is  no  index 
on  A,  then  a  sequential  scan  of  the  file  is  required  to  evaluate  the  query.  Let  CS 
be  the  cost  of  the  sequential  scan  of  the  file.  If  there  is  an  index  associated  with 
A,  then  it  may  or  may  not  be  profitable  to  use  the  index  to  evaluate  the  query. 
Let  CI^  be  the  cost  of  accessing  the  index  and  CR^-  be  the  cost  of  retrieving  the 
blocks  of  the  file  that  contain  qualifying  records.  An  algorithm  for  processing 
single  attribute  queries  is:  if  C/®  +  CR^  <  CS  use  the  index  on  the  attribute  to 
retrieve  the  qualifying  records.  Otherwise,  use  a  sequential  scan  of  the  file. 


Let be  the  I’ecord  selectivity  of  the  query  Q,  IBF  be  the  blocking  factor 
of  the  index  blocks,  IL  be  the  number  of  levels  of  the  index,  3F  be  the  blocking 
factor  of  the  file  blocks,  and  BS^-  be  the  block  select  ivity  of  the  query  Q.  The 

RS^  - 


cost  CI^  can  be  estimated  as 


+  IL  and  the  cost  CR^  as  if  the  attribute 


A  is  a  clustering  attribute,  or  BS^  if  the  attribute  A  is  not  a  clustering  attribute. 
The  above  formula  assumes  that  the  RlD’s  are  sorted  already  when  retrieved 
from  the  index,  or  that  whenever  a  block  is  retrieved,  because  one  record  of  the 
block  contains  one  qualifying  record,  ail  the  records  of  the  block  are  examined 
to  decide  if  they^  qualify'  in  the  query.  When  this  assumption  is  not  satisfied  (for 
example  in  a  non-equality'  query'  or  when  the  output  should  be  presented  accord¬ 
ing  to  the  order  of  the  records  in  the  file)  the  cost  of  sorting  the  itlD's  should  be 

RS^  -RS^ 

added  to  the  index  cost  Iog^(  jjjjf)  if  a-  z  way'  external  merge  sort  is  used. 
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■where  z  is  the  number  of  available  buffers  for  the  sort). 

Consider  a  file  of  100000  records.  The  hie  blocks  contain  50  records  per 
block,  while  the  leaf  index  blocks  contain  200  (attribute  value, RID)  pairs  per 

block.  The  hie  occupies  =  2000  blocks  of  secondary  storage.  Thus  the 

cost  of  sequentially  scanning  the  hie  is  2000  block  accesses.  Let  A  be  an  attri¬ 
bute  of  the  file  having  a  domain  of  10  distinct  attribute  values  (1,2,3 . 10)  each 

one  involving  the  same  number  of  records  of  the  file.  The  number  of  leaf  pages 
containing  (attribute  value. RID)  pairs  to  the  blocks  of  the  file  with  a  certain 

attribute  A  value  is  200  “  blocks.  Assuming  that  the  height  of  the  index 

is  3  levels,  the  cost  of  retrieving  the  index  for  a  specific  value  of  A  is  53  blocks. 
To  evaluate  the  attribute  condition  (A  ^  2),  the  sequential  scan  is  always  the 
best  strategy.  The  reason  is  that,  to  evaluate  this  inequality  using  the  index  it 
is  required  to  retrieve  3  +  8  *  50  =  403  blocks  of  index,  and  at  least 
80000 


50 


=  1600  blocks  of  records.  (1600  blocks  of  records  will  be  retrieved  if  the 


fil-^  is  clustered  on  attribute  A.)  Therefore  the  total  cost  will  always  be  more  than 
the  2000  block  accesses  that  the  sequential  scan  requires.  However,  for  queries 
involving  between  1  and  8  distinct  attribute  values  of  A,  the  optim-al  query 
evaluation  strategy  depends  on  the  placement  of  the  qualifjnng  records  among 
the  blocks  of  the  file. 

Realistic  examples  were  given  in  Chapter  7,  where  the  correlation  of  attri¬ 
bute  values  with  the  clustering  attribute  in  the  engineering  data  base  caused 
qualifying  records  to  cluster  in  a  small  portion  of  the  blocks  of  the  file.  In  this 
case,  the  use  of  indices  -was  sho-vvn  to  be  profitable  for  some  of  the  queries  on 
responsibility  level,  while  the  sequential  scan  Vv'as  the  indicated  strategy  with  a 
model  using  random  placement.  Figures  7.1.3,  7.1.4.  7.1.5.  and  7.1.6  present 
estimates  of  the  block  selectmty  and  the  error  that  mav  occur  with  the  random 
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placement  assumption  when  it  is  not  satisfied. 


Cost 
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Figure  8.1.1 

Selection  of  an  optimal  strategy  for  non-uniform  placement. 


Figure  8.1.1  shows  the  cost  of  queries  having  various  record  selectivities  for 
two  different  placements  of  the  qualifying  records  in  the  secondary  strorage: 
one  with  uniform  density  distribution,  and  one  with  unimodal  density  distribu¬ 
tion.  The  cost  of  the  sequential  scan  is  also  indicated.  It  can  be  seen  from  the 
figure  that  sequential  scan  is  less  frequently  the  indicated  strategy  for  the  uni¬ 
modal  placement  of  qualifying  records  than  for  the  uniform  one.  Thus  for  single 
attribute  conditions,  accurate  modelling  of  the  data  base  contents  as  well  as  the 
data  placement  on  devices  are  essential  for  the  selection  of  an  optimal  strategy. 

Conjunctive  conditions 

In  order  to  be  able  to  use  indices  instead  of  a  sequential  scan  to  evaluate  a 
conjunctive  condition,  some  of  the  attributes  participating  in  the  conjunction 
must  have  an  index  associated  with  them.  If  more  than  one  of  the  participating 
attributes  are  indexed,  we  would  like  to  select  a  set,  S,  of  indexed  attributes  for 
use  in  processing  the  conjunction.  The  indices  corresponding  to  those 
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at, tributes  will  be  retrieved  from  the  secondary  memory  and  then  intersected  to 

find  the  set  of  RID’s  that  satisfy  the  condition.  The  records  corresponding  to 

this  set  of  RID’s  will  be  retrieved  and  then  examined  to  find  the  records  that 

satisfy  the  other  attibute  conditions.  The  set  S  should  minimize  the  following 

quantity;  "*■  where  Clf  is  the  cost  of  retrie\’ing  the  portion  of  the 

j  eS 

index  corresponding  to  the  attribute  condition  on  the  attribute  j  in  the  query  Q, 
5"  is  a  set  of  indices  to  be  used  for  processing  the  query,  and  CR§  is  the  cost  of 
retrieving  the  records  qualifying  in  the  conjunction  of  the  attribute  conditions  in 
Q  that  correspond  to  the  set  S  (block  selectivity  for  these  records).  (Alterna¬ 
tively  the  cost  could  include  sorting  of  indices.)  An  algorithm  for  query  evalua¬ 
tion  of  conjunctive  queries  with  minimal  block  transfers  is: 

1.  Set  Smin  =  0  and  Cmin  =  CS,  where  is  the  set  of  indices  that  give  the 
minimum  cost  so  far  and  is  this  cost. 

2.  For  i  =  1,2 . I  perform  steps  3  to  5. 

3.  Find  the  set  of  indices  5'  with  IS"']  =  i  which  minimizes  the  cost 

jeS' 

For  '1  =  1,  keep  the  cost  Clj^  for  all  j  and  order  them  such  that  Cl9  ^  CI^,  for 
j  ^  k.  Let  ejnin  be  the  minimum  cost  and  the  set  of  the  corresponding 
indices. 

Cmin  ^min  C'niin  “  ^min»  min  ~  min 

5.  If  Cinm  ^  T,  Clf  go  to  6. 

j  =  l,i+l 

6.  If  then  use  the  set  of  indices  to  evaluate  the  conjunction. 

Else  use  a  sequential  scan. 

The  termination  criterion  in  step  5  is  justified  by  the  fact  that  the  cost  of  an 
optimal  strategy  that  uses  at  least  i  +  1  indices  is  greater  or  equal  to  the  cost  of 
retrieving  the  i  +  1  shorter  indices  from  the  secondary"  storage.  If  the  total  cost 
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of  an  optimal  strategy  that  uses  at  most  i  indices  is  less  than  the  cost  of  retriev¬ 
ing  the  shortest  i  +  1  indices,  then  this  is  an  optimum  stategy.  Theoretically 
this  algorithm  requires  a  combinatorial  number  of  iterations  in  order  to  find  the 
optimal  strategy.  However,  the  number  of  qualifying  records  typically  decreases 
rapidly  with  the  number  of  indices  used.  Thus  only  a  few  indices  will  in  most 
cases  be  enough  to  satisfy  the  termination  cr  ilei  ion. 

Consider  the  same  file  of  100000  records  and  the  attributes  A  and  B  with  10 
distinct  attribute  values  from  1  to  10.  The  conjunctive  condition  (A  ^  ^  7) 

4  4 

involves  -  *  -  *  100000  =  16000  records  if  we  assume  that  A  and  B  follow  the 

10  10 

uniformity  and  independence  assumptions.  The  number  of  blocks  of  the  index 
required  for  the  evaluation  of  the  query  is  *  (3  +  50  *  4)  =  406  blocks.  Under 
the  random  selection  assumption,  a  sequential  scan  of  the  file  is  always  the 
recommended  strategy.  In  our  model,  the  recommended  stategy  could  be 
sequential  scan,  index  on  A,  index  on  B,  or  on  both  indices.  The  reason  is  that  in 
conjunctive  conditions  both  the  correlation  of  attribute  values  and  the  non- 
random  placement  of  the  records  with  respect  to  one  or  more  of  the  participat¬ 
ing  attributes  are  important  for  the  selection  of  the  optimal  strategy.  If  the 
records  of  the  file  that  qualify  in  the  condition  (4^7)  for  example  are  clustered 
in  a  small  portion  of  the  address  space  of  the  file,  a  profitable  strategy  would  be 
to  use  the  index  on  A  to  retrieve  the  reeords  qualifying  in  this  condition  and 
then  examine  the  condition  {B  on  these  records.  The  cost  of  retrieving  the 
qualifying  records  in  the  condition  (A  ^7)  depends  on  how  well  these  records  are 
clustered  among  the  blocks  of  the  file  as  was  discussed  for  the  single  attribute 
queries.  The  record  selectivity,  and  thus  the  optimal  strategy,  is  sensitive  to 
correlations  and  other  dependencies  among  the  attribute  values  of  the  attri¬ 
butes  participating  in  the  query. 
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Figure  B.1.2  shows  the  cost  of  evaluating  two  conjunctive  queries  on  the 
attributes  A  and  B  as  a  function  of  the  correlations  of  the  two  attributes.  It  is 
assumed  that  the  qualifying  records  are  randomly  placed  in  the  blocks  of  the  file 
and  that  the  density  distribution  in  the  subspace  of  the  attributes  A  and  B  fol¬ 
lows  a  multivariate  unirnodal  distribution.  The  figure  shows  the  cost  of  a  stra¬ 
tegy  that  uses  both  indices  to  retrieve  the  qualifying  records.  The  cost  of  the 
sequential  scan  is  also  shown  for  comparison.  As  the  figure  shows,  the  cost  of  an 
optimal  strategy  for  a  conjunctive  condition  on  the  two  attributes  depends  on 
the  attribute  values  of  the  two  attributes  that  appear  in  the  query,  as  well  as  on 
the  correlation  of  the  attribute  values  of  the  records  of  the  file. 


Cost 


BF=50 
BFI=200 
XL=  3 

N=100000 

(without 

sorting 

costs) 


0  Correlation  r 


query 


1  using  indices 
sequential  scan 


query  2  using  indices 


T 


Figure  B.1.2 

Cost  as  a  function  of  correlations 

It  should  be  mentioned  here  that  in  addition  to  the  choice  between  a 
sequential  scan  or  the  use  of  indices  for  evaluating  a  conjunctive  query  the 
choice  of  an  optimum  indexing  set  to  process  the  query  depends  also  on  the 
correlation  of  the  attribute  values  of  the  attributes  in  the  query.  Consider  a 
conjunctive  query  on  three  attributes  A,  B.  and  C  with  100,  100,  and  50  distinct 
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values  respectively.  If  the  correlations  of  the  attribute  values  of  the  three  attri¬ 
butes  are  not  taken  into  account  for  selecting  the  optimal  strategy,  the  indices 
of  the  first  attributes  ■will  be  used  to  evaluate  a  conjunctive  equality  condition  on 
the  three  attributes  because  only  a  small  number  of  records  in  the  file  is 
expected  to  qualify  in  the  query  (100000/(100*100)=  10  records).  However,  as 
many  as  10000  records  may  qualify  if  these  two  attributes  are  highly  correlated. 
In  this  case  the  use  of  all  three  indices  for  evaluating  the  query  would  further 
reduce  the  cost.  Moreover,  if  C  has  attribute  values  correlated  with  the  values 
of  one  of  the  other  attributes  (say  with  A),  then  the  choice  of  the  indexing  set 
|A,C]  could  be  optimal. 

In  summary,  for  conjunctive  queries,  attribute  value  correlations  are 
important  for  selecting  an  optimal  set  of  indices  for  piocessing  the  query.  In 
this  section  we  assumed  that  the  single  attribute  conditions  involved  less  than 
50%  of  the  records  of  the  file.  If  this  is  not  the  case  for  one  attribute,  further 
reductions  in  the  index  cost  are  possible  by  retrieving  the  part  of  the  index 
corresponding  to  the  remaining  values  of  the  attribute  and  evaluating  the  condi¬ 
tion  AHD-NOT  instead  of  AND.  Once  more,  accurate  estimations  of  the  record 
selectivities  for  each  attribute  will  result  in  a  strategy  that  is  typically  better 
than  the  one  derived  from  a  model  using  uniformity  of  attribute  values. 

Disjunctive  Conditions 

The  choice  of  an  optimal  strategy  for  evaluating  disjunctive  conditions  will 
always  be  between  a  sequential  scan  of  the  file  and  the  use  of  all  the  indices  of 
the  attributes  in  the  disjunctive  condition  (in  contrast  to  the  conjunctive  condi¬ 
tions  where  the  use  of  only  a  subset  of  the  indices  could  also  be  profitable).  To 
avoid  a  sequential  scan  in  a  disjunctive  condition,  all  the  attributes  participating 
in  the  disjunction  should  be  indexed.  In  this  case,  indices  should  be  used  for  the 
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evaluation  of  the  query  only  if  +  CR^  <  CS  where  CI^  is  the  cost  of  retrieving 
the  portion  of  the  indices  necessary  for  evaluating  the  disjunction,  CR^  is  the 
number  of  blocks  of  the  secondary  storage  containing  the  qualifying  records, 
and  CS  is  the  cost  of  a  sequential  scan  of  the  file. 

Data  correlations  and  data  placement  on  de\'ices  affect  the  optimization 
strategy  for  disjunctive  conditions,  but  to  a  lesser  degree  than  in  conjunctive 
conditions.  The  reason  is  that  the  number  of  records  that  qualify  in  a  conjunc¬ 
tion  is  small  with  respect  to  the  sum  of  the  number  of  records  that  qualify  in 
each  attribute  condition  (thus  the  effect  of  data  correlations  is  smaller).  Also, 
since  all  the  indices  have  to  be  used  in  the  evaluation  of  the  disjunction,  the  fact 
that  one  of  the  attribute  conditions  specified  has  qualifying  records  clustered  in 
a  small  number  of  the  blocks  of  the  file  affects  to  a  lesser  degree  the  total  cost. 
In  disjunctive  conditions,  the  distribution  of  attribute  values  of  each  attribute 
seem  to  be  an  important  factor  for  the  choice  of  an  optimal  strategy.  Thus  the 
assumption  of  uniformity  of  the  attribute  values  of  each  attribute  may  be  the 
most  significant  source  of  errors  in  estimating  the  cost  of  processing  disjunctive 
conditions. 

Conjunctive  normal  form  queries 

A  conjunctive  normal  form  query  was  defined  as  a  conjunction  of  disjunctive 
conditions  We  will  call  a  single  disjunctive  condition  indexed,  if  all  the  attri¬ 
butes  participating  in  Ai  have  an  index  associated  with  them.  Only  then  can 
indices  be  used  for  resolving  Ai.  The  algorithm  to  process  queries  in  conjunctive 
normal  form  is  similar  to  the  algorithm  that  processes  conjunctive  conditions. 
The  indices  of  only  the  indexed  attribute  conditions  can  be  used  for  processing 
the  conjunctions.  PYorn  these  indexed  disjunctions  a  suitable  subset  S  must  be 
chosen,  and  the  cost  of  evaluating  the  query  using  indices  must  be  compared 
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with  the  cost  of  evaluating  the  query  using  a  sequential  scan.  In  the  case  that 
there  are  not  enough  buffers  in  main  memory  for  keeping  the  index  set,  the  cost 
of  storing  intermediate  results  must  also  be  taken  into  account. 

In  conjunctive  normal  form  queries  the  placement  of  the  records  in  the 
blocks  of  the  file  is  an  important  factor  for  the  choice  of  an  optimal  strategy. 
Consider  the  query  A  f^i(B  [  J C ),  where  A,  B,  and  C  arc  attribute  conditions  on 
the  attributes  A,  B  and  C,  each  having  a  selectivity  of  .3.  To  use  all  three  indices 
for  evaluating  the  query  is  expensive  since  it  requires  3*153=459  blocks  of  index 
(in  addition  to  the  sorting  cost  if  an}'’)  and  a  large  portion  of  the  blocks  of  the  file 
also  contains  qualifying  records  (1500  blocks  approximately,  assuming  indepen¬ 
dence  and  random  placement).  On  the  other  hand  for  retrieving  the  index  on  A 
only,  153  block  accesses  are  required.  For  various  degrees  of  clustering  of  the 
records  qualifying  in  the  attribute  condition  on  A,  the  cost  vdll  be  between  753 
and  2153  blocks  in  comparison  to  the  2000  blocks  of  the  sequential  scan.  Thus 
in  some  cases  large  savings  may  result  by  using  the  index  on  A  to  process  the 
query  instead  of  a  sequential  scan  of  the  file. 

Disjunctive  normal  form  queries 

Disjunctive  normal  form  queries  may  have  to  be  transformed  into  a  con¬ 
junctive  normal  form  before  further  optimization  (if  possible).  Liu  [Liu  1976] 
has  examined  the  problem  of  transforming  a  query  into  an  optimum  query  tree, 
but  her  cost  measures  are  appropriate  for  one  level  storage  only.  We  assume 
that  whenever  a  term  in  the  conjunction  can  be  factored  out,  the  query  is  al^ways 
transformed.  For  example  the  query  (A  )  ij(d  p,C' )  is  always  transformed 
into  the  query /I  p(i^  Before  other  cost  considerations.  The  reason  is  that 

the  number  of  alternative  strategies  to  be  examined  is  reduced  and  that  cluster¬ 
ings  of  certain  attribut.e  values  can  be  realized  by  the  optimization  algorithm.  If 
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there  are  several  transformations  possible,  all  of  them  have  to  be  examined  to 
decide  about  an  optimal  strategy.  We  do  not  consider  here  transformations  that 
may  lead  to  a  mixed  form  (not  pure  conjunctive  or  pure  disjunctive  normal 
form),  but  query  processing  of  these  forms  could  be  similar  to  that  for  the  dis¬ 
junctive  normal  form.  In  order  to  use  indices  to  evaluate  a  query  in  disjunctive 
normal  form,  each  conjunctive  condition  of  the  qijery  must  have  at  least  one 
attribute  indexed.  There  are  three  steps  involved  in  the  evaluation: 

1.  For  each  conjunction  a  set  of  indices  to  evaluate  the  conjunction  is  found 
as  described  before.  If  in  one  conjunction  the  sequential  scan  is  the 
optimal  strategy,  then  the  sequential  scan  is  used  for  processing  the  query. 

(d.  These  pointers  are  processed  in  the  disjunctions  with  the  set  of  pointers 
selected  to  process  the  other  conjunctions. 

3.  The  records  pointed  to  by  the  set  of  pointers  is  retrieved  and  the 
remaining  conjunctive  conditions  are  evaluated. 

Steps  1  and  3  involve  block  accesses  and  thus  cost  for  the  system.  In  dis¬ 
junctive  normal  form  queries  the  distribution  of  attribute  values  for  each  attri¬ 
bute,  as  well  as  correlations  among  attribute  values,  and  data  placement  on  dev¬ 
ices  affect  the  choice  of  the  optimal  strategy. 

Simple  numeric  conditions 

Sequential  scan  of  the  file  to  evaluate  simple  numeric  conditions  may  be 
avoided  if  at  least  one  of  the  participating  attributes  has  an  index.  Consider  the 
simple  numeric  condition  .4  +  B  =  c.  If  we  know  the  range  of  values  of  the  attri¬ 
butes  A  and  B,  then  we  can  find  a  sub-domain  of  A  that  possibly  satisfies  the  con¬ 
dition.  If  the  domain  of  values  of  the  attribute  A  is  the  set  of  integers  from  1  to 
10,  the  domain  of  values  of  B  is  the  same,  and  the  constant  c  is  20,  then  only  the 
records  of  the  file  with  the  value  10  for  their  attributes  A  or  B  may  satisfy  the 
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condition.  If  c  is  15.  then  the  possible  values  of  A  that  satisfy  the  condition  arc 
from  5  to  10.  Therefore  if  A  is  indexed  we  can  use  the  index  to  retrieve  the 
records  of  the  file  with  attribute  A  values  frurii  5  to  10  and  evaluate  the  condition 
on  these  records  in  order  to  isolate  the  qualifying  ones.  Alternatively,  wc  could 
use  both  indices  on  A  and  B  (if  they  exist)  in  order  to  further  restrict  the 
number  of  qualifying  records.  An  optimal  strategy  will  choose  between  a 
sequential  scan,  using  one  index  only,  or  using  both  indices. 

According  to  the  previous  description,  simple  numeric  conditions  can  be 
evaluated  as  a  set  of  conjunctive  conditions  on  the  participating  attributes. 
Thus,  they  are  very  sensitive  to  the  correlations  of  attribute  values.  Moreover, 
the  placement  of  the  qualifying  records  among  the  blocks  of  the  file  is  also 
important  for  the  selection  of  an  optimal  strategy.  As  an  example  consider  the 
selection  A  >  B  where  A  has  4  distinct  attribute  values  (1. 2.3.4)  and  B  has  8  (1,2, 
...  ,8).  The  possible  combinations  of  attribute  values  that  qualify  are  (2,1),  (3,1), 
(3,2),  (4,1),  (4,2),  (4,3).  Assuming  independence  of  attribute  values  the  number 
of  records  that  qualify  is  (6/32)*100000  =  18750.  The  index  cost  will  be 
100000/200  +  3  +  (3/8)*(100000/200)  +  3  =  693  blocks.  Therefore,  using  a 
model  that  assumes  independence  and  random  placement,  the  sequential  scan 
is  the  optimal  strategy.  However,  if  the  attribute  values  of  the  attributes  A  and 
B  are  highly  correlated  and/or  the  attribute  values  of  A  or  B  are  clustered  in  a 
small  number  of  blocks  in  secondary  storage,  the  use  of  indices  may  be  lass 
costly. 

Figure  8.1.3  shows  the  cost  of  a  selection  A  =-B,  of  two  attributes  A  and  B,  as 
a  function  of  the  correlation  of  the  attribute  values  of  A  and  B,  assuming  a  ran¬ 
dom  placement  of  the  qualifying  records  in  the  secondary  storage.  The  cost  of 
the  sequential  scan  is  also  indicated.  The  optimal  strategy  is  a  function  of  the 
correlation  of  the  attribute  values. 
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Figure  8.1.3 

Cost  of  a  selection  as  a  function  of  correlations 

As  an  example  of  a  more  complicated  query  consider  the  query 
{D  =  d)f){E  =  F)  which  is  a  query  in  a  conjunctive  normal  form  according  to  our 

1 

definition.  Assuming  that  D=''d"  has  a  selectivity  and  E  and  F  involve  50  dis- 

tincL  values  each,  the  cost  of  using  all  indices  may  be  less  than  the  cost  of  a 
sequential  scan.  The  cost  of  retrieving  the  index  on  D  ="d'’  is  53  block  accesses, 

while  for  evaluating  the  condition  E  =  F  is  (  +  3)  *  2  =  1006  block 

accesses.  The  total  number  of  qualifying  records  assuming  independence  is 

—  *  *  100000  =  400  records.  In  this  case  the  use  of  all  the  indices  for 

5  50 

evaluating  the  query  may  save  540  block  accesses.  However,  if  E  and  F  have 
correlated  attribute  values,  the  number  of  records  that  qualify  in  the  query  may 
be  very  large  and  cause  selection  of  records  from  virtually  all  the  blocks  of  the 
file.  If  the  use  of  indices  has  been  chosen,  the  total  cost  may  reach  2000  +1059 
block  accesses,  an  extra  cost  of  1059  block  accesses  in  comparison  to  the  cost 
of  a  sequential  scan  of  the  file. 
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Processing  queries  in  faster  sites 

In  this  section  we  examine  the  problem  of  query  processing  in  a  resLricLed 
distributed  environment.  In  this  environment,  a  copy  of  the  same  file  exists  in 
both  a  central  site  with  fast  disks,  as  well  as  in  a  satellite  site. 

One  way  to  process  a  query  in  this  environment  is  to  evaluate  the  query  in 
the  central  site  and  then  transmit  the  result  to  the  satellite  site.  Since  the  cen¬ 
tral  site  has  fast  disks,  the  result  will  be  produced  faster.  However,  the  cost  of 
establishing  communication  between  the  tv.’-o  sites  as  well  as  the  eost  of  transfer- 
ing  the  result  through  the  teleeommunication  lines  has  to  be  added  to  the  cost. 
The  total  cost  of  processing  the  query  in  the  remote  site  has  to  be  compared 
with  the  total  cost  of  processing  the  query  in  the  local  site  in  order  to  choose  an 
optimal  strategy.  The  cost  of  transmitting  the  result  is  given  by  Cq  +  C  ^*P*RS^ , 
where  Cq  is  the  cost  of  establishing  communication  between  sites,  C i  is  the  cost 
of  transfering  one  record  from  one  site  to  another,  and  P  is  the  projectivity  of 
the  query.  Local  processing  is  chosen  if 

di*{BS^  +  Y^CP)  ^  dr*{BS^  +  X;CP)  +  2*Cq  +  CpP*RS^‘ 

3  3 

where  di  and  dr  are  the  block  access  times  in  the  local  and  the  remote  site 
respectively.  The  profitability  of  processing  in  the  remote  site  depends  heavily 
on  the  record  selectivity  of  the  query,  and  thus  on  the  distribution  of  attribute 
values  for  each  attribute  as  well  as  on  dependencies  among  attribute  values  of 
different  attributes. 

Figures  8.1.4  and  6.1.5  show  the  effect  of  data  correlations  on  the  choice  of 
a  site.  For  very  small  record  selectivities,  the  cost  of  establishing  communica¬ 
tion  between  sites  makes  the  local  processing  attr  active.  This  is  shown  in  figure 
8.1.4  where  a  restriction  is  considered.  For  intermediate  record  selectivities 
the  remote  site  is  attractive  because  of  its  faster  devices.  For  large  record 
selectivities  the  local  site  becomes  more  profitable  because  the 


-  123  - 


telecommunication  cost  dominates.  This  is  shown  in  figure  8.1.5  where  a  selec¬ 
tion  {A  =  5  )  is  considered. 


F’igure  3.1.4 

Cost  of  local  and  remote  processing  of  a  restriction  as  function  of  correlations 


Figure  8.1.5 

Cost  of  local  and  remote  processing  of  a  selection  as  funcLiori  of  correlations 
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Multirelational  environmenLs 

In  this  section  we  discuss  some  implications  of  data  dependencies  on  the 
cost  of  multirelational  queries  and  we  show  that  the  choice  of  an  optimal  stra¬ 
tegy  is  sensitive  to  these  dependencies.  Detailed  treatment  of  query  evaluation 
in  multirelational  environments  using  models  of  uniform  and  independent  attri¬ 
bute  values  appeax's  in  a  number  of  publications  andwill  not  be  repeated  here. 
Relevant  references  are  [Yao  and  De  Jong  1979],  [Selinger  et  al.  1979],  [Kersch- 
berg  et  al.  1980]. 

To  perform  a  parallel  join  of  two  relations,  records  or  pointers  to  qualifying 
records  of  one  relation  have  to  be  processed  in  parallel  with  qualifying  records 
or  pointers  from  the  other  relation.  By  "in  parallel",  we  mean  in  increasing  (or 
decreasing)  order  of  attribute  values  of  the  joining  attribute.  In  a  parallel  join 
three  possible  strategics  for  accessing  the  records  of  one  of  the  relations  are 
the  following: 

1.  Retrieve  the  index  on  the  joining  attribute  if  there  is  one,  intersect  it 
with  the  joining  attribute  values  of  the  second  relation  such  that  values  that 
do  not  appear  in  the  join  are  eliminated,  and  use  the  reduced  joining  index 
to  retrieve  the  records  of  the  file  that  may  participate  in  the  join.  Examine 
the  retrieved  records  if  they  satisfy  the  restrictions,  and  eliminate  those 
that  do  not  satisfy  them.  The  remiaining  records  are  processed  in  the  join. 
The  cost  of  this  strategy  is  the  cost  of  retrieving  the  join  index  plus  the  cost 
of  retrieving  the  records  from  the  secondary  storage. 

2.  Retrieve  the  join  index,  retrieve  the  restriction  indices,  intersect  them 
to  find  pointers  of  the  join  index  that  satisfy  the  restrictions,  process  in 
parallel  the  remaining  values  of  the  join  index  with  the  values  of  the  join 
attribute  of  the  second  file  such  that  values  that  are  not  common  in  the  two 
joining  domains  are  eliminated,  and  use  the  remaining  pointers  to  retrieve 
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the  records  of  the  file  that  participate  in  the  join.  The  cost  of  this  strategy 
is  the  sum  of  the  costs  of  retrieving  the  join  index,  the  restriction  index, 
and  the  records  participating  in  the  join. 

3.  Retrieve  the  restriction  index,  use  it  to  retrieve  the  records  of  the  file 
qualifying  in  the  restrictions,  sort  these  records  according  to  the  values  of 
the  join  domain  and  process  them  in  the  join.  The  cost  of  this  strategy  is 
the  sum  of  the  costs  of  retrieving  the  restriction  indices  and  the  records 
qualifying  in  the  restriction,  plus  the  cost  of  externally  sorting  the  retrieved 
records. 

Figures  8.1.6  and  8.1.7  present  the  cost  of  these  three  strategies  as  a  func¬ 
tion  of  the  correlation  of  the  restriction  attributes.  As  can  be  seen,  the  optimal 
strategy  depends  on  the  cori  elation  of  the  alLribute  values  and  the  difference  in 
the  cost  between  the  optimal  strategy  and  a  non-optimum  strategy  can  be  con¬ 
siderable. 


Figure  8.1.6 

Cost  of  three  strategies  for  accessing  one  file  for  query  A 


The  size  of  the  join  was  defined  as  the  number  of  records  that  the  join  of 
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two  attributes  has. 


Figure  8.1.7 

Cost  of  three  strategies  for  accessing  one  file  for  query  B 

The  size  of  the  join  is  a  very  important  factor  in  choosing  optimum  query  pro¬ 
cessing  strategies.  The  reason  is  that  the  size  of  the  join  can  be  ver)?'  large  in 
comparison  to  the  sizes  of  the  original  relations.  Two  cases  where  the  size  of  the 
join  can  be  a  very  important  factor  in  choosing  a  strategy  are  in  joins  involving 
more  than  two  relations  [Astrahan  et  al.  1979]  and.  in  distributed  environments 
[Kerschberg  et  al.  1980].  When  the  joins  of  more  than  two  relations  are  involved, 
it  is  desirable  to  keep  the  size  of  intermediate  relations  small.  Large  intermedi¬ 
ate  relations  imply  large  10  costs  for  storing  (and  possibly  sorting)  intermediate 
results,  as  well  as  many  more  comparisons.  Thus,  it  is  important  that  join 
sequences  that  create  large  intermediate  relations  be  avoided.  The  size  of  the 
join  is  very  important  even  for  two  relation  joins  in  a  distributed  environment. 
The  reason  is  that  the  resulting  relation  may  be  formed  in  a  remote  site  and 
then  be  transferred  into  the  local  site.  Since  the  telecommunication  cost 
depends  linearly  on  the  number  of  records  transferred  betiveen  sites,  the 
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optimal  strategy  is  sensitive  to  the  size  of  the  join. 

The  size  of  the  join  depends  on  the  distribution  of  the  attribute  values  in  the 
joining  domain  of  the  two  relations.  Moreover,  when  restrictions  or  selections  on 
values  of  certain  attributes  precede  the  join,  the  size  of  the  join  depends  on  the 
correlation  of  the  attribute  values  of  these  attributes  with  the  join  attribute. 
Figures  8.1.8  and  8.1.9  show  the  size  of  the  join  as  a  function  of  the  correlations 
of  the  attribute  values  of  the  restriction  or  selection  attribute  with  the  joining 
attribute.  Both  figures  show  strong  dependence  of  the  size  of  the  join  on  the 
correlation  of  the  attribute  values. 
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Figure  8.1.8 

Size  of  the  join  for  query  C 


In  conclusion,  an  optimal  strategy  should  take  into  account  dependencies 
among  various  attributes.  If  these  dependencies  are  omitted  in  the  estimation 
of  selectivities,  the  strategy  choosen  by  the  optimizer  may  be  considerably 
more  expensive  than  an  optimal  one. 


-  128  - 


Figure  8.1.9 

Size  of  the  join  for  quer}'"  D 

8.2  Index  Selection 

In  this  section  we  examine  the  problem  of  index  selection  and  give  exam¬ 
ples  of  performance  improvement  by  selecting  a  set  of  indices  based  on  the 
improved  approximations.  We  also  discuss  index  selection  in  clustered  files,  and 
clustering  attribute  selection. 

Index  selection 

The  problem  of  index  selection  has  been  examined  by  various  researchers 
and  algorithms  have  been  proposed  for  selecting  an  optimal  or  near  optimal  set 
([Hammer  and  Chan  1976],  [Anderson  and  Berra  1978]).  Basically,  these  algo¬ 
rithms  incorporate  a  new  attribute  in  a  candidate  indexing  set  only  if  it  reduces 
the  overall  system  cost.  We  will  not  reproduce  these  algorithms  here.  A  survey 
can  be  found  in  [Batory  1981]. 

The  selection  of  an  optimum  indexing  set  depends  on  the  query  optimizer 
used.  For  a  sophisticated  optimizer,  such  as  the  one  presented  in  the  previous 
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section,  for  a  candidate  indexing  set  D,  the  cost  of  the  optimal  query  processing 
strategy  Opt(Q)  for  the  query  Q  can  be  found  using  the  statistics  kept  by  the 
optimizer.  Therefore,  for  a  candidate  indexing  set  D,  the  total  system  cost 
T{0^t{Q))  for  all  the  queries  in  a  time  interval  can  be  estimated.  The  iteration 

however,  of  the  optimization  procedure  for  all  queries  Q  for  the  candidate  index¬ 
ing  sets  Di  may  be  expensive.  Large  reductions  in  the  time  needed  for  selecting 


an  optimal  indexing  set  will  occur 


each  iteration  examining  only  these  queries  affected  by  the  change  of  the  index¬ 
ing  set  Di.  For  a  less  sophisticated  optimizer  that  always  uses  all  the  available 
indices  in  order  to  evaluate  a  query,  the  queries  can  be  separated  into  classes 
according  to  the  attributes  participating  in  the  query  and  the  type  of  the  query 
conjunctive  or  disjunctive.  Given  the  frequency  of  each  class  and  the  distribu¬ 
tion  of  attribute  values  in  queries,  the  average  system  cost  can  be  estimated 
using  average  selectivities  for  a  given  set  Di.  This  estimation  can  be  much  more 
inexpensive  than  the  one  described  before  for  sophisticated  optimizers. 

Intuitively,  an  attribute  is  a  good  candidate  for  indexing  if  it  is  often 
referred  to  in  the  user  queries,  if  the  existence  of  the  index  reduces  consider¬ 
ably  the  average  cost  of  answering  queries,  and  if  the  attribute  values  of  this 
attribute  are  not  often  modified  by  the  users.  The  selection  of  an  attribute  for 
indexing  depends  also  on  the  other  attributes  that  have  already  been  selected 
for  indexing  and  on  the  attributes  that  appear  in  queries  with  it.  For  example,  if 
the  attribute  A  has  already  been  selected  for  indexing,  and  the  attributes  A  and 
B  appear  often  together  in  conjunctive  queries,  then  the  profitability  of  an  index 
on  B  decreases  because  these  queries  can  be  answered  by  using  the  index  on  A 
to  retrieve  the  portion  of  the  records  of  the  file  that  qualify  in  the  attribute  con¬ 


dition  on  A.  and  then  evaluate  the  conjunctive  condition  on  the  retrieved 
records.  Thus  dependencies  of  the  attribute  values  in  the  file  as  well  as  in  the 
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user  queries  affect  the  choice  of  attributes  for  indexing.  Moreover,  the  distribu¬ 
tion  of  the  attribute  values  in  the  file,  as  well  as  the  placement  of  records  in  the 
file,  also  affect  the  choice  of  an  attribute  for  indexing  because  they  affect  the 
average  retrieval  cost. 

Next  we  consider  examples  to  illustrate  the  differences  between  our  model 
and  previous  approaches.  Consider  an  environment  where  only  single  attribute 
queries  are  asked  by  the  users.  Figure  8.2.1  shows  the  average  cost,  in  block 
accesses  as  a  function  of  the  standard  deviation  of  the  attribute  values  of  an 
attribute  which  has  attribute  values  following  a  Normal  distribution.  We  also 
assume  uniform  queries  and  random  placement  of  the  qualifying  records  among 
the  blocks  of  the  file.  A  given  maintainance  cost  divides  the  axis  into  two  sec¬ 
tions,  one  allowing  the  creation  of  an  index  on  this  attribute  and  another  where 
the  creation  of  an  index  is  not  profitable. 


Figure  8.2. 1 

Average  cost  of  single  attribute  queries 


As  a  second  example  consider  an  environment  where  only  conjunctive 
queries  exist.  Assume  that  a  candidate  indexing  set  I)  contains  the  attribute  A 
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and  that  the  attribute  B  appears  always  with  the  attribute  A  in  conjunctions. 
The  cost  of  conjunctive  queries  on  the  attributes  A  and  B  is  shown  in  Figure  8.2.3 
as  a  function  of  the  correlation  of  the  attribute  values  of  A  and  B  (assuming  that 
the  attribute  values  in  the  conjunctive  queries  are  uniformly  distributed  over  all 
the  combinations  of  the  attribute  values,  and  that  the  qualifying  records  are 
randomly  placed  among  the  blocks  of  the  file).  Again,  the  rnaintainance  cost  of 
the  attribute  B  separates  the  figure  in  two  sections,  one  where  the  attribute  B  is 
profitable  for  indexing  and  one  where  it  is  not.  Previous  approaches  to  estimat¬ 
ing  the  system  cost  are  not  capable  of  taking  into  account  these  differences  due 
to  correlation  of  attribute  values. 


Average 

cost 


Figure  8.2.2 

Index  selection  in  conjunctive  query  environment 


Index  selection  in  clustered  files 

The  average  system  cost  for  a  set  of  queries  posed  in  a  time  interval 
depends  on  the  placement  of  the  records  in  a  hie.  Thus,  an  indexing  set  that 
minimizes  the  expected  system  cost  for  a  non-clustered  file  may  not  minimize 
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the  expected  system  cost  if  the  records  of  the  same  file  were  clustered  accord¬ 
ing  to  the  values  of  one  of  its  attributes. 

Consider  an  environment  vfith  queries  on  single  attributes  only.  Figure 
8.2.3  shows  the  cost  of  queries  on  an  attribute  A  when  the  file  is  not  cluster  ed 
and  when  the  file  is  clustered  on  the  values  of  an  attribute  B  (as  a  function  of  the 
correlation  of  the  attribute  values  of  the  attributes  A  and  D).  The  queries  are 
assumed  to  be  uniforiiily  distributed  over  the  attribute  values  of  A.  The  mainte¬ 
nance  cost  for  the  index  on  A  divides  the  diagram  in  two  sections.  One  for  small 
correlations  of  the  attribute  values  of  the  attributes  A  and  B.  where  an  index  on 
A  is  not  profitable,  and  another  for  high  correlations,  where  an  index  on  A  is 
profitable.  For  the  non-clustered  file,  an  index  on  attribute  A  is  never  profitable. 
The  selection  of  attribute  B  for  indexing  is  only  possible  with  the  improved 
approximations. 


Average 

cost 


cost 


Figure  B.2.3 

Index  selection  in  a  clustered  file  with  single  attribute  queries 
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Clustering  attribute  selectiori 

Consider  an  environment  with  frequent  requests  for  processing  of  records 
of  a  file  in  the  order  of  the  attribute  values  of  various  attributes.  For  example, 
in  an  employee  file  queries  may  often  request  that  the  records  of  the  employees 
of  one  or  more  departments  be  processed  according  to  department  number 
(D^^),  while  other  queries  may  often  require  that  the  records  of  the  employees  of 
one  or  more  projects  be  processed  according  to  the  project  number  (P#).  The 
first  type  of  query  is  facilitated  if  the  file  is  clustered  on  D#,  while  the  second  is 
facilitated  if  the  file  is  clustered  on  ¥§.  Another  reason  for  these  two  clusterings 
is  that  frequent  joins  between  the  relation  of  employees  with  the  relations  of 
departments  or  projects  may  be  required.  Selecting  among  two  or  more  candi¬ 
date  attributes  for  clustering  requires  the  use  of  an  automatic  method,  since 
the  optimal  set  of  indices  for  each  clustering  has  to  be  found,  and  the  estimation 
and  comparison  of  the  overall  system  cost  for  each  candidate.  A  procedure  for 
clustering  attribute  selection  is  given  below: 

1.  Select  one  candidate  attribute  for  clustering. 

2.  Select  an  optimal  indexing  set  for  this  clustering  and  estimate  the 

overall  system  cost. 

3.  Repeat  steps  1  and  2  for  all  clustering  attribute  candidates.  Select  for 

clustering  the  attribute  with  the  least  average  cost. 

In  the  overall  system  cost,  the  cost  of  reorganizations  has  to  be  added, 
although  we  ignore  it  here. 


B.3  Performance  Prediction 

Performance  measures  appropriate  for  data  base  performance  predictors 
include  the  number  of  blocks  accessed,  device  utilizations,  and  response  times. 
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Data  base  system  performance  predictors  may  be  used  for  comparing  alterna¬ 
tive  design  decisions  or  for  examining  the  impact  of  a  design  decision  or  a 
change  in  the  data  base  workload  on  response  times  and  device  utilizations. 

Certain  design  decisions,  however,  cannot  be  confidentiy  compared  ’when 
the  assumptions  of  uniformity,  independence,  and  random  placement  are  used. 
Examples  of  the  errors  introduced  by  these  assumptions  were  shown  in  the  pre¬ 
vious  sections  of  this  chapter.  The  error  in  the  estimation  of  the  expected  sys¬ 
tem  cost  is  sometimes  very  significant,  so  that  differences  in  performance  given 
as  output  of  the  data  base  performance  predictor  may  not  have  any  meaning.  In 
other  cases,  no  difference  in  performance  between  the  two  alternatives  may  be 
predicted,  while  actually  there  is  a  difference  (for  example,  index  selection  in 
clustered  and  non-clustered  files). 

When  response  time  problems  motivate  changes  in  the  data  base,  a  good 
prediction  in  the  response  time  is  required.  Otherwise  the  cost  of  the  change 
ma}?’  not  be  justified.  In  such  cases,  more  detailed  modelling  of  data  base  con¬ 
tents,  data  placement  on  devices  and  user  requests  may  avoid  the  large  errors 
in  the  estimation  of  the  response  time  of  the  system  by  the  data  base  perfor¬ 
mance  predictor. 


8.4  Summary 

In  this  chapter  we  have  discussed  applications  of  the  improved  approxima¬ 
tions  of  selectivities  in  query  evaluation,  index  selection,  and  performance  pred¬ 
iction.  We  have  presented  examples  to  demonstrate  that  considerable  savings 
may  occur  by  using  the  improved  approximations  of  selectivities. 

In  the  query  evaluation  section,  we  presented  a  set  of  algorithms  that  take 
advantage  of  the  improved  approximations  of  selectivities  to  reduce  the  cost  of 
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evaluating  the  user  queries  for  a  wide  class  of  queries.  In  the  data  base  design 
section,  solutions  to  the  problems  of  index  selection,  index  selection  in 
clustered  files,  and  clustering  attribute  selection  were  discussed.  The  latter  two 
problems  can  only  be  examined  by  using  the  improved  approximations  of  selec- 
tivities.  Moreover,  the  improved  approximations  of  selectivities  can  be  useful  in 
a  variety  of  other  data  base  performance  studies.  Examples  are  the  attribute 
partitioning  problem  [Hammer  and  Niamir  1979],  the  allocation  of  copies  in  vari¬ 
ous  sites  [Casey  1972],  and  multi-attribute  retrieval  file  organizations  [Liou  and 
Yao  1977]. 

Most  previous  researchers  have  not  validated  the  models  that  they  have 
proposed  for  solving  the  data  base  problems  described.  Others  have  validated 
their  models  using  input  data  that  satisfy  the  assumptions  of  the  model.  (For 
example,  the  performance  of  the  relational  optimizer  in  System  R  has  been 
examined  by  using  data  that  satisfy  the  assumptions  of  uniformity,  indepen¬ 
dence,  and  random  placement  [Astrahan  et  al.  1980].)  The  degrees  of  error  in 
the  estimated  performance  by  using  the  random  placement,  uniformity,  and 
independence  assumptions  presented  in  this  chapter  contribute  to  our 
knowledge  of  the  limitations  of  the  previous  models. 
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Chapter  9:  Conclusions  and  Future  Research 


In  this  thesis  we  have  examined  closely  various  assumptions  made  in  model¬ 
ling  data  base  system  performance.  In  particular,  we  have  examined  the 
assumptions  made  often  in  existing  models  and  the  feasibility  for  improvement 
in  modelling  data  base  contents,  data  placement  on  devices,  and  user  queries. 

The  work  presented  has  both  practical  and  academic  interests.  From  the 
academic  point  of  view  we  have  shown  that  certain  functions  used  in  data  base 
performance  evaluation  have  the  Schur  concave  property.  We  have  thus  associ¬ 
ated  the  theory  of  majorization  which  plays  an  important  role  in  the  theory  of 
mathematical  inequalities  with  these  functions.  Well-known  and  understood 
mathematical  quantities  like  entropy,  standard  deviation,  correlation  and  kur- 
tosis  were  also  related  to  the  performance  measures. 

Our  approach  for  estimating  record  selectivities  generalizes  the  previous 
approaches  that  are  based  on  uniformity  and  independence  assumptions.  Our 
approach  for  estimating  block  selectivities  generalizes  and  associates  existing 
approaches  by  Yao  and  Cardenas,  which  are  based  on  a  random  placement 
assumption. 

From  a  practical  point  of  view,  we  have  shown,  using  actual  data,  that  the 
assumptions  used  in  previous  models  are  not  satisfied  in  some  common,  large 
data  base  environments.  We  have  also  shown  that  in  certain  cases,  large  errors 
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in  estimating  performance  may  result.  Degrees  of  error  in  estimated  perfor¬ 
mance  presented  in  this  thesis  contribute  to  our  knowledge  of  the  limitation  of 
the  solutions  given  to  some  data  base  performance  evaluation  problems  in  prior 
research. 

Some  new  consequences  of  the  theory  developed  here  can  be  useful  in  data 
base  design.  One  consequence  is  that  it  is  more  profitable  to  index  on  attributes 
with  non-uniform  attribute  values,  than  on  those  with  uniform  ones.  Another  is 
that,  in  clustered  files,  the  profitability  of  an  attribute  A  for  indexing  increases 
as  the  absolute  value  of  the  correlations  betvv'^een  the  attribute  A  and  the  clus¬ 
tering  attribute  increases.  Thus,  exploitation  of  non-uniformity  and  correlations 
of  attribute  values  in  a  file  and  in  the  user  queries  could  reduce  the  overall  sys¬ 
tem  cost.  With  respect  to  improving  the  existing  modelling  techniques  we  have 
shown  that  our  techniques  can  provide  substantially  better  approximations  of 
record  and  block  selectivities.  Our  approach  in  estimating  block  selectivities  is 
also  useful  in  the  important  case  where  the  file  is  clustered  on  the  attribute 
values  of  one  of  its  attributes.  The  actual  and  the  simulated  experimental 
results  presented  in  the  thesis  point  out  the  applications  and  the  environments 
where  the  better  approximations  of  selectivities  are  useful. 

Emphasis  has  been  given  in  this  thesis  to  the  fact  that  in  data  base  perfor¬ 
mance  evaluation  studies,  careless  modelling  of  data  base  contents  and  data 
placement  on  devices  may  lead  to  serious  errors  in  the  estimated  performance. 
(This  is  in  contrast  to  systems  wdthout  data  base  management  component  where 
often  simple  modelling  of  data  is  adequate  [Seveik  1978J.)  The  multivariate  sta¬ 
tistical  approach  proposed  for  modelling  data  base  contents,  data  placement  on 
devices,  and  user  requests,  can  lead  to  an  improved  model.  To  use  the  improved 
approximations  of  selectivities  is  more  costly  than  to  use  the  uniformity  and 
independence  assumptions.  However,  the  increase  in  cost  is  neglibie  relative  to 
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Data  Base  operating  budgets.  For  a  potential  user  of  the  techniques  proposed  in 
this  thesis  the  chapter  on  the  applications  of  the  improved  approximations 
(Chapter  8)  can  be  useful  for  deciding  about  using  these  techniques.  Careful 
analysis  of  the  specific  environment  may  indicate  that  large  savings  may  occur 
by  using  the  improved  approximations  of  selectivities.  More  insight  on  the 
effects  of  non-uniformity,  dependencies,  non-random  placement  and  distribution 
of  attribute  values  in  queries  is  given  in  Chapter  5.  Finally  Chapters  6  and  7 
describe  the  improved  approximations  of  selectivities. 

It  is  a  future  research  project  to  examine  the  utility  of  these  techniques  in 
actual  data  base  environments,  as  well  as  to  classify  the  categories  of  em.dron- 
ments  where  these  techniques  are  particularly  useful  (for  example  as  in 
[Hawthorn  19B1]). 

Many  problems  in  data  base  performance  have  been  examined  in  isolation. 
The  relative  significance  of  the  errors  in  modelling  is  not  understood.  As  an 
example  the  use  of  a  buffer  has  been  examined  in  isolation.  However,  non- 
uniform  distribution  of  attribute  values  in  the  user  queries  could  significantly 
increase  the  profitability  of  using  a  buffer.  Moreover,  the  distribution  of  attri¬ 
bute  values  in  a  file  and  in  the  user  queries  -will  affect  the  optimial  size  of  the 
buffer  for  a  particular  data  base.  Locking  is  also  affected  by  the  distribution  of 
attribute  values  in  user  queries.  A  unified  examination  of  these  problems  could 
increase  our  understanding  of  data  base  system  performance. 

Finally  the  development  of  good  physical  data  base  performance  predictors 
is  an  important  research  project. 
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Appendix  1:  Parameter  Estimation  and  Update 


In  this  section  we  describe  how  to  make  a  best  fit  of  Pearson  curves  to  the 
data  in  a  file,  and  how  parameters  of  the  distributions  can  be  estimated.  The 
kurtosis  (3  is  defined  as 


Xi 

where  Xq  is  the  qr’th  central  moment  [Elderton  1953].  Unimodal  Type  2  distribu¬ 
tions  have  a  (3  ranging  from  1.8  to  3.0,  corresponding  respectively  to  the  uniform 
and  the  Normal  distributions.  Unimodal  Type  7  distributions  have  a  /S  ranging 
from  3.0  (for  normal),  to  infinity.  Therefore,  the  Type  of  the  distribution  is 
determined  by  the  value  of  p.  The  parameter  k  in  one  dimension  can  be  deter¬ 
mined  from  the  relation: 


The  determination  of  the  parameter  m  for  a  multivariate  distribution  can  be 
done  in  n-space  using  the  radial  distribution  [Cooper  1963].  However,  [Cooper 
1964]  suggests  that  in  practice  we  could  determine  k  for  the  marginal  in  each 
coordinate  direction,  and  an  eirithmetic  average  of  these  could  be  used  for 
determining  the  marginal  k.  For  a  data  base  with  a  large  number  of  attributes, 
where  only  a  small  subset  of  them  is  used  in  the  qualification  part  of  each  query, 
the  latter  approach  seems  preferable,  and  we  have  used  it  in  our  calculations. 
The  parameters  of  the  distributions,  therefore,  can  be  determined  as  follows: 


Al.l 
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C  =  S  (2'-  in)(z"-  m)’  A1.2 

i=i 

^9  =  -^T,  I2'-  m|^  A1.3 

where  z’’  is  a  vector  formed  by  the  ordered  attribute  values  of  the  attributes  of 
the  record.  All  the  parameters  can  be  estimated  in  one  file  pass  by  using  an 
appropriate  expansion  of  the  formulae  (Al.l),  (Ai.2),  and  (A1.3).  For  Type  7  dis¬ 
tributions  with  low  values  of  k  (high  peaks),  the  qi  th  moment  may  not  exist.  In 
one  dimension  when  ^  5/2  the  kurtosis  cannot  be  used  for  determining  k. 
Cooper  [Cooper  1964]  suggests  two  alternative  methods  for  estimating  the 
parameters  for  low  values  of  k,  the  most  interesting  being  the  use  of  fractional 
moments,  which  is  a  bit  more  expensive  but  is  still  an  adaptive  technique.  We 
have  not  encountered  such  a  distribution  in  our  experiments  with  actual  files. 
However,  we  think  that  an  alternative  technique  that  examines  a  specific  attri¬ 
bute  value  as  a  separate  attribute  could  also  be  useful.  This  technique  would 
also  take  care  of  the  "all-other"  values.  In  some  cases,  one  attribute  value  is 
reserved  for  something  that  is  not  what  the  attribute  describes  (for  exam.ple 
"NO-EDUCATION"  for  the  attribute  education,  or  "NO-RELIGION"  for  the  attribute 
religion,  or  "TRADED"  for  the  attribute  price,  or  "UNKNOWN"  for  a  certain  attri¬ 
bute).  Such  attribute  values  may  present  high  peaks,  different  correlations  with 
the  rest  of  the  attributes  and  may  often  participate  in  queries  with  the 
qualification  "NOT".  In  this  case,  if  the  exception  value  is  treated  as  a  separate 
attribute  (with  attribute  values  one  for  the  exception  value  and  two  for  any 
other  value  say),  the  calculations  will  be  fasLei'  and  more  accurate. 

Care  should  be  taken  in  the  parameter  estimation  for  large  data  bases.  The 
estimation  of  the  parameters  bf  the  distributions  can  be  harnpexed  by  poten¬ 
tially  large  roundoff  errors.  Alternative  methods  for  estimating  the  parameters 
of  a  multivariate  Normal  distribution  so  that  potentially  large  errors  can  be 
avoided  are  given  in  [Neely  1966].  The  best  estimations  are  achieved  with  two 
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passes  of  the  data,  one  that  gives  approximate  values  of  the  parameters  (this 
could  be  a  smaller  sample),  and  a  second  where  the  actual  values  of  the  parame¬ 
ters  are  estimated.  In  the  second  pass,  only  differences  from  the  approximate 
values  of  the  parameters  are  used  in  the  calculations,  and  thus  large  roundoff 
errors  are  avoided.  The  potentially  large  roundoff  error  for  large  data  bases 
gives  another  reason  why  polynomial  approximations  are  not  so  attractive  in 
such  environments. 

The  speed  of  the  evaluation  of  the  probability  density  function  is  a  very 
important  consideration  for  our  environment.  An  advantage  of  using  a 
parametric  technique  is  that  efficient  methods  can  be  developed  for  the  fast 
evaluation  of  the  probability  density.  Sebestyen  and  Edie  ([Sebestyen  and  Edie 
1968])  described  the  successful  application  of  a  method  for  the  fast  estimation 
of  the  probability  density  in  multivariate  Normal  distributions,  in  a  speech 
recognition  environment.  The  method  precomputes  and  stores  piecewise 
approximations  of  the  normalized ;  density.  This  approach  is  not  possible  with 
general  polynomial  aproximations  of  the  density. 

As  was  mentioned  before,  the  method  is  adaptive,  and  thus  the  parameters 
of  the  distributions  can  be  updated  periodically  by  the  values  of  the  new  records 
inserted  in  the  data  base  without  the  need  to  go  through  all  the  data  again.  The 
formulae  for  the  adaptive  update  are  given  next. 

If  a  new  sample  (N  +  l)st  comes,  the  mean  vector  becomes: 

1  N  +  l 

“  ttVt m(iV)  +  xv+i) 

where  m(N  +  l)  is  the  estimate  obtained  with/Z  +  l  samples,  and  m(iV)  is  the  esti¬ 
mate  obtained  with  N  samples.  If  the  updates  are  done  in  batches,  the  above 


-  142  - 


formula  bcoomei; 

We  can  obtain  anedogous  results  for  the  covariance  matrix; 

C  (iV+ 1)=^  [(x— m)(x-m)  ] 

1^  +  1^ 

=  Yj  XyXy-m(iV  +  l)ni’(A^  +  i) 

=  -~(NC(N)+Xs*iT^^i)-m(N  +  l)m(N  +  l) 

This  expression  can  be  used  for  updating  the  covariance  matrix  with  the  (A’  +  l)st 
sample.  For  the  batch  update  of  the  covariance  matrix  we  can  obtain  the  for¬ 
mula: 


C{N+M)  =  j^iNC{N)+^XKM^S^)-m.{ff*M)m{N+if) 
For  the  fourth  moment,  X4,  in  the  ith  axis  we  have 


.  N-i-l 


J  =  1 


1  .  . 

—  -  /  \  ^  /  n*  ? - -TTH  /  .  t  - 'm  jr  ,  - 

"  Af  +  I  •'  ••-A*w  ; 


(A'A'i  (A')+(i;^*,  -ml-*,  r) 


A'  + 1 


For  the  batch  update  of  the  fourth  moment  we  obtain: 


X\ iN+M)=j^-{jVX\ (AT )+  E  (4 *i -mh^j )^) 
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Appendix  2:  Average  Record  Selectivities  for  Normal  Distributions 


In  this  section  we  will  show  that  the  average  class  record  selectivity  for  the 
case  of  multivariate  Normal  distributions  is  given  by 


Axi  •  •  • 


A2.1 


where  S  is  the  portion  of  the  covariance  matrix  that  corresponds  to  the  selected 
subspace,  E  is  the  covariance  matrix  in  the  query  class  space,  C  is  a  positive 
definite  matrix,  and  M  is  a  constant.  We  will  also  show  how  to  estimate  C  and  M. 

The  average  record  selectivity  for  a  multivariate  Normal  distribution  ia: 


yexp [(-l/'2)(x- m)  S  ^(x-m)  + 


Sav  =  N 


+  (-  l/2)(x-/x)'E"Hx-/r)]dx 


A2.2 


We  set 


(x-m)  S  *(x- m)  +  (x- /r)  E  ^(x- /ii)  =  (x- a)  C(x- a)  +  Af 


A2.3 


and  we  will  estimate  C,  a,  and  M.  Let  a  =  E”^  and  s  =  S“^  We  have: 

xCx- xCa-aCx+aCa+Af  = 

=  X  orx  -  X  cr/x  -  /x’ox  +  /la’ct/x  +  x’sx  -  x  sm  -  msx  +  m’sm 

We  now  equate  terms  of  equal  degree  with  respect  to  x.  From  the  second  degree 
equations  we  get: 


Cij  =  Gij  +  Sij  (n^  equations) 


A2.4 


From  the  first  degree  equations  we  get: 


(n  equations  for  i  =  l,...n) 

3  3  3 


Let 


3 


3 


The  ttj’s  can  be  computed  as  solution  of  the  system 


Ca  =  /S 


A2.5 
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From  the  terms  of  zero  degree  we  get: 


M  =  rn'em  +  iiay.  -  a  Ca  =  m  sm  +  iiayL  —  a  (b  +  o-)a 
=  m’sm  a  sa  +  —  aVa 

”  ^^^ijlTliTTlj  “  '^S'ijCLillj  +  ~  <Jij  dj  dp 

if  if  if  if 

Therefore  M  can  be  estimated  from 


^  -  a^aj))  A2.6 

if 


Formulae  (A2.4),  (A2.5)  and  (A2.6)  can  be  used  for  the  computation  of  C  ,  a  and 


M. 


We  will  show  now  that  C  is  positive  definite.  By  definition,  we  must  prove 
that,  for  every  x^O,  x'Cx  >  0.  We  will  use  a  theorem  of  the  matrix  algebra  that 
says  that  if  a  matrix  is  positive  definite,  its  inverse  is  also  positive  definite. 
Since  E  and  S  are  both  positive  definite,  the  E”^  and  S"^  are  positive  definite. 
Then  for  every  x»*0  we  have: 

x’E“^x  >  0,  or  ^CTijXiXj  >  0 

if 

and  xS”^x  >  0,  or  >  0 

ij 

where  and  are  the  (i,j)  elements  of  the  matrices  E“^  and  S“^  respectively. 
But  for  X  9ft  0  we  have: 

XCX=  J]^ijXiXf=Y,{(^ij  +  Sij)XiXf 

a  a 

=  Y,(TijXiXj  +  Yi^ijXiXj  >  0 

^  a 

Therefore  C  is  positive  definite. 

Finally  we  will  prove  that 

^  Axi  •  •  • 

Substituting  (A2.3)  to  (A2.2)  we  get 


S 


av 


Zbci  •  •  •  Ax^e 
(27TnEp'^'2|S| 


f  exp[-l/2((x- 


a)’C(x-  a))]dx 
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Since  C  is  symmetric  positive  definite  matrix^  there  is  a  non-singular  matrix  A, 
such  that  A'CA  =  I,  the  identity  matrix.  Let  x-a=Ay,  where  y  is  a  vector 

y  =  (i/i.ya.  •  •  •  >yn)'> 

Then 


(x-  a)'C(x-  a)  =  (Ay)'C(Ay) 

=yACAy 

The  Jacobian  of  the  transformation  xsa+AyisJaie  |A|,  where  |  A|  indicatwe  Um 

absolute  value  of  the  determinant  of  A.  With  the  above  transformation  the 
integral  becomes: 

Inter,  =  \k\J  •  ■  •  f  '  dy\ 

=  |A1/  .  ■  •  ■  dy, 

=  |Ain(V2n) 

i=l 

=  lAKaTTr-^ 

But  since  A'CA  =  I,  we  have: 


lA’IICllAl  =  lAl^lCl  =  1 

(because  for  any  matrix  [A* I  *  |A|.  and  |I|  =  1).  Then  |A| 
tutlng: 


1 


and  subsii' 


And  finally: 


Inter,  — 


^  (277)”^ 
Cjl/-2 


^av  ~ 


=  N- 


A2;i 


Axj^e 


-M/2. 


n/2 


(27rrisi^-^'2jsi*'^^  ici^/2 

Axi  •  •  • 


(27r) 


n  /2  I 


\/Z 


Si 


1/2  I 


1/2 


V 
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Appendix  3;  Bounds  of  Block  Selectivities  for  Uniform  Distributions 


In  this  section  we  show  that  when  the  probability  that  a  block  of  the  file 
contains  a  qualifying  record  is  uniform  over  all  the  blocks  of  the  file,  the  bounds 
on  the  expected  block  accesses  approximated  by  our  model  coincide  with  the 
exact  value  of  the  expected  block  accesses. 

For  uniform  probability  density,  the  probability  that  a  block  will  not  be 
selected  becomes: 


Pn,(t)  = 


l/M  . 

\-{k/N)  ’ 


N 


Therefore  the  number  of  blocks  selected  is: 


fl,  =  ji/{i-n( 

4-  =n 


which  is  Yao’s  result  [Yao  1977].  V 
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Appendix  4:  A  Less  Expensive  Formula  for  Block  Selectivities 


In  this  section  we  will  provide  a  computationally  less  expensive  formula  to 
approximate  block  selectivities. 

1 

Expanding  (1-P,(i))  near  P,(i)  s  0  for  ibz  <  1  we  obtain: 


\-kz 


Therefore 


1 


1  —  Jka 


)) 


t=l 


For  z  =  max(pg(a:_j))  this  formula  approximates  the  upper  bound  of  the  expected 


expected  value  of  the  block  selectivity. 
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